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Abstract 

We  extend  the  theory  of  d-separation  to  cases  in  which  data  instances  are  not  indepen¬ 
dent  and  identically  distributed.  We  show  that  applying  the  rules  of  d-separation  directly 
to  the  structure  of  probabilistic  models  of  relational  data  inaccurately  infers  conditional 
independence.  We  introduce  relational  d-separation,  a  theory  for  deriving  conditional  in¬ 
dependence  facts  from  relational  models.  We  provide  a  new  representation,  the  abstract 
ground  graph,  that  enables  a  sound,  complete,  and  computationally  efficient  method  for 
answering  d-separation  queries  about  relational  models,  and  we  present  empirical  results 
that  demonstrate  effectiveness. 

Keywords:  relational  models,  d-separation,  conditional  independence,  lifted  representa¬ 
tions,  directed  graphical  models 


1.  Introduction 

The  rules  of  d-separation  can  algorithmically  derive  all  conditional  independence  facts  that 
hold  in  distributions  represented  by  a  Bayesian  network.  In  this  paper,  we  show  that 
d-separation  may  not  correctly  infer  conditional  independence  when  applied  directly  to 
the  graphical  structure  of  a  relational  model.  We  introduce  the  notion  of  relational  d- 
separation — a  graphical  criterion  for  deriving  conditional  independence  facts  from  relational 
models — and  define  its  semantics  to  be  consistent  with  traditional  d-separation  (i.e.,  it 
claims  independence  only  when  it  is  guaranteed  to  hold  for  all  model  instantiations).  We 
present  an  alternative,  lifted  representation — the  abstract  ground  graph — that  enables  an 
algorithm  for  deriving  conditional  independence  facts  from  relational  models.  We  show  that 
this  approach  is  sound,  complete,  and  computationally  efficient,  and  we  provide  an  empirical 
demonstration  of  effectiveness  across  synthetic  causal  structures  of  relational  domains. 

The  main  contributions  of  this  work  are: 

•  A  precise  formalization  of  fundamental  concepts  of  relational  data  and  relational  mod¬ 
els  necessary  to  reason  about  conditional  independence  (Section  [4]) 


©2013  Marc  Maier  and  Katerina  Marazopoulou  and  David  Jensen. 


Maier,  Marazopoulou,  and  Jensen 


•  A  formal  definition  of  d-separation  for  relational  models  analogous  to  d-separation 
for  Bayesian  networks  (Section  [5]) 

•  The  abstract  ground  graph — a  lifted  representation  that  abstracts  all  possible  ground 
graphs  of  a  given  relational  model  structure,  as  well  as  proofs  of  the  soundness  and 
completeness  of  this  abstraction  (Section  |5.1[) 

•  Proofs  of  soundness  and  completeness  for  a  method  that  answers  relational  d-separation 
queries  (Section  |5.2[) 

We  also  provide  an  empirical  comparison  of  relational  d-separation  to  traditional  d- 
separation  applied  directly  to  relational  model  structure,  showing  that,  not  only  would 
most  queries  be  undefined,  but  those  that  can  be  represented  yield  an  incorrect  judgment 
of  conditional  independence  up  to  50%  of  the  time  (Section  [6]).  Finally,  we  offer  addi¬ 
tional  empirical  results  on  synthetic  data  that  demonstrate  the  effectiveness  of  relational 
d-separation  with  respect  to  complexity  and  consistency  (Section  [7]).  The  remainder  of  this 
introductory  section  first  gives  a  brief  overview  of  Bayesian  networks  and  their  generaliza¬ 
tion  to  relational  models  and  then  describes  why  d-separation  is  a  useful  theory. 


1.1  From  Bayesian  Networks  to  Relational  Models 


Bayesian  networks  are  a  widely  used  class  of  graphical  models  that  are  capable  of  compactly 
representing  a  joint  probability  distribution  over  a  set  of  variables.  The  joint  distribution 
can  be  factored  into  a  product  of  conditional  distributions  by  assuming  that  variables  are 
independent  of  their  non-descendants  given  their  parents  (the  Markov  condition).  The 
Markov  condition  ties  the  structure  of  the  model  to  the  set  of  conditional  independencies 
that  hold  over  all  probability  distributions  the  model  can  represent.  Accurate  reasoning 
about  such  conditional  independence  facts  is  the  basis  for  constraint-based  algorithms,  such 


as  PC  and  FCI  (Spirtes  et  al.,  2000),  and  hybrid  approaches,  such  as  MMHC  (Tsamardinos 


et  ah,  2006),  that  are  commonly  used  to  learn  the  structure  of  Bayesian  networks.  Under  a 


small  number  of  assumptions  and  with  knowledge  of  the  conditional  independencies,  these 


algorithms  can  identify  causal  structure  (Pearl,  2000.  Spirtes  et  ah,  2000). 


Deriving  the  full  set  of  conditional  independencies  implied  by  the  Markov  condition  is 
complex,  requiring  manipulation  of  the  joint  distribution  and  various  probability  axioms. 
Fortunately,  the  exact  same  set  of  conditional  independencies  entailed  by  the  Markov  con¬ 
dition  are  also  entailed  by  d-separation,  a  set  of  graphical  rules  that  algorithmically  derive 
conditional  independence  facts  directly  from  the  graphical  structure  of  the  model.  That 
the  Markov  condition  and  d-separation  are  equivalent  approaches  for  producing  con- 
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ditional  independence  from  Bayesian  networks  (Verma  and  Pearl,  1988;  Geiger  and  Pearl 


1988,  Neapolitan!  |2004 ) .  When  interpreting  a  Bayesian  network  causally,  the  causal  Markov 
condition  (variables  are  independent  of  their  non-effects  given  their  direct  causes)  and  d- 
separation  have  been  shown  to  provide  the  correct  connection  between  causal  structure  and 


conditional  independence  (Schemes,  1997). 


Bayesian  networks  assume  that  data  instances  are  independent  and  identically  dis¬ 
tributed,  but  many  real-world  systems  are  characterized  by  interacting  heterogeneous  enti¬ 
ties.  For  example,  social  network  data  consist  of  individuals,  groups,  and  their  relationships; 
citation  data  involve  researchers  collaborating  and  authoring  scholarly  papers  that  cite  prior 
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work;  and  sports  data  include  players,  coaches,  teams,  referees,  and  their  competitive  inter¬ 
actions.  Over  the  past  15  years,  researchers  in  statistics  and  computer  science  have  devised 
more  expressive  classes  of  directed  graphical  models,  such  as  probabilistic  relational  models 
(PRMs),  which  remove  the  assumptions  of  independent  and  identically  distributed  instances 


to  more  accurately  describe  these  types  of  domains  (Getoor  and  Taskar.  2007).  Relational 


models  generalize  other  classes  of  models  that  incorporate  interference,  spillover  effects,  or 


violations  of  the  stable  unit  treatment  value  assumption  (SUTVA)  (Hudgens  and  Halloran 


2008 


(Gelrnan  and  Hill,  2007). 


Tchetgen  Tchetgen  and  VanderWeele  2012)  and  multilevel  or  hierarchical  models 


Many  practical  applications  have  also  benefited  from  learning  and  reasoning  with  rela¬ 


tional  models.  Examples  include  analysis  of  gene  regulatory  interactions  (Segal  et  al.,  2001), 


scholarly  citations  (Taskar  et  al. 


cellular  networks  (Friedman,  2004),  epidemiology  (Getoor  et  ah,  2004),  and  security  in  in¬ 


2001),  ecosystems  (D’Ambrosio  et  ah,  2003),  biological 


formation  systems  (Sommestad  et  al.  2010).  The  structure  and  parameters  of  these  models 
can  be  learned  from  a  relational  data  set.  The  model  is  typically  used  either  to  predict 
values  of  certain  attributes  (e.g.,  topics  of  papers)  or  the  structure  is  examined  directly 
(e.g.,  to  determine  predictors  of  disease  spread).  A  major  goal  in  many  of  these  applica¬ 
tions  is  to  promote  understanding  of  a  domain  or  to  determine  causes  of  various  outcomes. 
However,  as  with  Bayesian  networks,  to  effectively  interpret  and  reason  about  relational 
models  causally,  it  is  necessary  to  understand  their  conditional  independence  implications. 


1.2  Why  d- Separation  Is  Useful 

A  Bayesian  network,  as  a  model  of  a  joint  probability  distribution,  enables  a  wide  array  of 
useful  tasks  by  supporting  inference  over  an  entire  set  of  variables.  Bayesian  networks  have 
been  successfully  applied  to  model  many  domains,  ranging  from  bioinformatics  and  medicine 
to  computer  vision  and  information  retrieval.  Naively  specifying  a  joint  distribution  by  hand 
requires  an  exponential  number  of  states;  however,  Bayesian  networks  leverage  the  Markov 
condition  to  factor  a  joint  probability  distribution  into  a  compact  product  of  conditional 
probability  distributions. 

The  theory  of  d-separation  is  an  alternative  to  the  Markov  condition  that  provides 
equivalent  implications.  It  provides  an  algorithmic  framework  for  deriving  the  conditional 
independencies  encoded  by  the  model.  These  conditional  independence  facts  are  guaranteed 
to  hold  in  every  joint  distribution  the  model  represents  and,  consequently,  in  any  data 
instance  sampled  from  those  distributions.  The  semantics  of  holding  across  all  distributions 
is  the  main  reason  why  d-separation  is  useful,  enabling  two  large  classes  of  applications: 

(1)  Identification  of  causal  effects:  The  theory  of  d-separation  connects  the  causal  struc¬ 
ture  encoded  by  a  Bayesian  network  to  the  set  of  probability  distributions  it  can  represent. 
On  this  basis,  many  researchers  have  developed  accompanying  theory  that  describes  the 
conditions  under  which  certain  causal  effects  are  identifiable  (uniquely  known)  and  algo¬ 
rithms  for  deriving  those  quantities  from  the  joint  distribution.  This  work  enables  sound 
and  complete  identification  of  causal  effects,  not  only  with  respect  to  conditioning,  but  also 


under  counterfactuals  and  interventions- 

—via  the  do-calculus  introduced  by 

Pearl 

(2000 

)- 

and  in  the  presence  of  latent  variables  ( 

Tian  and  Pearl, 

2002; 

Huang  and  Valtorta 

2006 

Shpitser  and  Pearl,  2008). 
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(2)  Constraint-based  causal  discovery  algorithms:  Causal  discovery,  the  task  of  learn¬ 
ing  generative  models  of  observational  data,  superficially  appears  to  be  a  futile  endeavor. 
Yet  learning  and  reasoning  about  the  causal  structure  that  underlies  real  domains  is  a 
primary  goal  for  many  researchers.  Fortunately,  d-separation  offers  a  connection  between 
causal  structure  and  conditional  independence.  The  theory  of  d-separation  can  be  lever¬ 
aged  to  constrain  the  hypothesis  space  by  eliminating  models  that  are  inconsistent  with 
observed  conditional  independence  facts.  While  many  distributions  do  not  lead  to  uniquely 
identifiable  models,  this  approach  (under  simple  assumptions)  frequently  discovers  useful 
causal  knowledge  for  domains  that  can  be  represented  as  a  Bayesian  network.  This  ap¬ 
proach  to  learning  causal  structure  is  referred  to  as  the  constraint-based  paradigm,  and 
many  algorithms  that  follow  this  approach  have  been  developed  over  the  past  20  years, 


including  Inductive  Causation  (IC)  (Pearl  and  Verma,  1991),  PC  (Spirtes  et  ah,  2000)  and 


its  variants,  Three  Phase  Dependency  Analysis  (TPDA)  (Cheng  et  al.,  1997),  Grow-Shrink 


(Margaritis  and  Thrun,  1999),  Total  Conditioning  (TC)  (Pellet  and  Elisseeff,  2008),  Recur¬ 
sive  Autonomy  Identification  (RAI)  (Yehezkel  and  Lerneq  2009),  and  hybrid  methods  that 
partially  employ  this  approach,  including  Max- Min  Hill  Climbing  (MMHC)  (Tsamardinos 


et  ah,  2006)  and  Hybrid  HPC  (H2PC)  (Gasse  et  al.  2012). 


As  described  above,  relational  models  more  closely  represent  the  real-world  domains  that 
many  social  scientists  and  other  researchers  investigate.  To  successfully  learn  causal  models 
from  observational  data  of  relational  domains,  we  need  a  theory  for  deriving  conditional 
independence  from  relational  models.  In  this  paper,  we  formalize  the  theory  of  relational 
d-separation  and  provide  a  method  for  deriving  conditional  independence  facts  from  the 
structure  of  a  relational  model.  In  another  paper,  we  have  used  these  results  to  provide 
a  theoretical  framework  for  a  sound  and  complete  constraint-based  algorithm — the  Rela¬ 
tional  Causal  Discovery  (RCD)  algorithm  (Maier  et  al.  2013) — that  learns  causal  models 
of  relational  domains. 


2.  Example 

Consider  a  corporate  analyst  who  was  hired  to  identify  which  employees  are  effective  and 
productive  for  some  organization.  If  the  company  is  structured  as  a  pure  project-based  or¬ 
ganization  (for  which  company  personnel  are  structured  around  projects,  not  departments), 
the  analyst  may  collect  data  as  described  by  the  relational  schema  in  Figure  |2.1]|(a)  The 
schema  denotes  that  employees  can  collaborate  and  work  on  multiple  products,  each  of 
which  is  funded  by  a  specific  business  unit.  The  analyst  has  also  obtained  attributes  on 
each  entity — salary  and  competence  of  employees,  the  success  of  each  product,  and  the 
budget  and  revenue  of  business  units.  In  this  example,  the  organization  consists  of  five 
employees,  five  products,  and  two  business  units,  which  are  shown  in  the  relational  skeleton 


in  Figure  2.1  b) 


Assume  that  the  organization  operates  under  the  model  depicted  in  Figure  2.  HE  For 


example,  the  success  of  a  product  depends  on  the  competence  of  employees  that  develop  it, 
and  the  revenue  of  a  business  unit  is  influenced  by  the  success  of  products  that  it  funds.  If 
this  were  known  by  the  analyst  (who  happens  to  have  experience  in  graphical  models),  then 
it  would  be  conceivable  to  spot-check  the  model  and  test  whether  some  of  the  conditional 
independencies  encoded  by  the  model  are  reflected  in  the  data.  The  analyst  then  naively 
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CT  Salary 

[ 

d^  Budget 

^Competences 

DEVELOPS^- 

dT  Success 

FUNDS  — 

dT  Revenue  Z> 

EMPLOYEE 

PRODUCT 

BUSINESS-UNIT 

(a)  Example  relational  schema  for  an  organization  consisting  of  employees  working  on  products,  which  are 
funded  by  specific  business  units  within  a  corporation. 


(b)  Example  fragment  of  a  relational  skeleton.  Roger  and  Sally  are  employees,  both  of  whom  develop  the 
Laptop  product,  but,  of  the  two,  only  Sally  works  on  product  Tablet.  Both  products  Laptop  and  Tablet  are 
funded  by  business  unit  Devices.  For  convenience,  we  depict  attribute  placeholders  on  each  entity  instance. 

Figure  2.1:  An  example  relational  schema  and  skeleton  for  the  organization  domain. 


applies  d-separation  to  the  model  structure  in  an  attempt  to  derive  conditional  independen¬ 
cies  to  test.  However,  applying  d-separation  directly  to  the  structure  of  relational  models 
may  not  correctly  derive  conditional  independencies,  violating  the  Markov  condition.  If 
the  analyst  were  to  discover  significant  and  substantive  effects,  he  may  believe  the  model 
structure  is  incorrect  and  needlessly  search  for  alternative  dependencies. 

Naively  applying  d-separation  to  the  model  in  Figure  2.S|[a)  suggests  that  employee 
competence  is  conditionally  independent  of  the  revenue  of  business  units  given  the  success 
of  products: 

Employee. Competence  _LL  Business-Unit. Revenue  |  Product. Success 

To  see  why  this  approach  is  flawed,  we  must  consider  the  ground  graph.  A  necessary 
precondition  for  inference  is  to  apply  a  model  to  a  data  instantiation,  which  yields  a  ground 
graph  to  which  d-separation  can  be  applied.  For  a  Bayesian  network,  a  ground  graph 
consists  of  replicates  of  the  model  structure  for  each  data  instance.  In  contrast,  a  relational 
model  defines  a  template  for  how  dependencies  apply  to  a  data  instantiation,  resulting  in  a 
ground  graph  with  varying  structure.  See  Section  [4]  for  more  details  on  ground  graphs. 
Figure  2.z  b)  shows  the  ground  graph  for  the  relational  model  in  Figure  2.^[a) 


ap¬ 


plied  to  the  relational  skeleton  in  Figure  2.1  T)|  This  ground  graph  illustrates  that,  for  a 
single  employee,  simply  conditioning  on  the  success  of  developed  products  can  activate  a 
path  through  the  competence  of  other  employees  who  develop  the  same  products — we  call 


5 


Maier,  Marazopoulou,  and  Jensen 


Salary 

Budget 

<^Competence^>^ 

^  DEVELOPS^- 

^dT  Success 

FUNDS  ^  — 

^dT  Revenue 

EMPLOYEE 

xx 

PRODUCT 

BUSINESS-UNIT 

[EMPLOYEE,  DEVELOPS,  PRODUCT,  FUNDS,  BUSINESS-UNIT|.Budget  - ►  [EMPLOYEE], Salary 

[EMPLOYEE], Competence  - ►  [EMPLOYEE].Salary 

[PRODUCT,  DEVELOPS,  EMPLOYEE].Competence  - ►  [PRODUCT].Success 

[BUSINESS-UNIT,  FUNDS,  PRODUCT].Success  - ►  [BUSINESS-UNITj.Revenue 


[BUSINESS-UNIT], Revenue  - ►  [BUSINESS-UNIT].Budget 


(a)  Example  relational  model.  Competence  of  employees  cause  the  success  of  products  they  develop,  which 
in  turn  influences  the  revenue  received  by  the  business  unit  funding  the  product.  Additional  dependencies 
involve  the  budget  of  business  units  and  employee  salaries.  The  dependencies  are  specified  by  relational 
paths,  listed  below  the  graphical  model. 


(b)  Example  fragment  of  a  ground  graph.  The  success  of  product  Laptop  is  influenced  by  the  competence 
of  both  Roger  and  Sally.  The  revenue  of  business  unit  Devices  is  caused  by  the  success  of  all  its  funded 
products — Laptop,  Tablet,  and  Smartphone. 


Figure  2.2:  An  example  relational  model  and  ground  graph  for  the  organization  domain. 


this  a  relationally  d-connecting  pat/iQ  Checking  d-separation  on  the  ground  graph  indi¬ 
cates  that  to  d-separate  an  employee’s  competence  from  the  revenue  of  funding  business 
units,  we  should  not  only  condition  on  the  success  of  developed  products,  but  also  on  the 
competence  of  other  employees  who  work  on  those  products  (e.g.,  Roger. Competence  _LL 
Devices. -Revenue  |  {Laptop.  Success,  Sally  .Competence}). 

This  example  also  demonstrates  that  the  Markov  condition  can  be  violated  when  directly 
applied  to  the  structure  of  a  relational  model.  In  this  case,  the  Markov  condition  according 

implies  that  P(Competence ,  Revenue  \  Success )  = 
P(Competence  \  Success)P (Revenue  \  Success),  that  revenue  is  independent  of  its  non¬ 
descendants  (competence)  given  its  parents  (success).  However,  the  ground  graph  shows 
the  opposite,  for  example,  P (Roger. Competence,  Devices. Revenue  |  Laptop. Success)  / 
P(Roger. Competence  (Laptop. Success)  P (Devices. Revenue  |  Laptop. Success).  In  fact,  for 
this  model,  d-separation  produces  many  other  incorrect  judgments  of  conditional  indepen¬ 
dence.  Through  simulation,  we  found  that  only  25%  of  the  pairs  of  variables  can  even  be 

1.  The  indirect  effect  attributed  to  a  relationally  d-connecting  path  is  often  referred  to  as  interference, 
a  spillover  effect,  or  a  violation  of  the  stable  unit  treatment  value  assumption  (SUTVA)  because  the 
treatment  of  one  instance  (employee  competence)  affects  the  outcome  of  another  (the  revenue  of  another 
employee’s  business  unit). 


to  the  model  structure  in  Figure  |2J 
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described  by  direct  inspection  of  this  model  structure,  and  of  those  (such  as  the  above  ex¬ 
ample),  75%  yield  an  incorrect  conclusion.  This  is  a  single  data  point  of  a  larger  empirical 
evaluation  presented  in  Section  [6|  Those  results  provide  quantitative  details  of  how  often 
to  expect  traditional  d-separation  to  fail  when  applied  to  the  structure  of  relational  models. 


3.  Semantics  and  Alternatives 


The  example  in  Section  [2]  provides  a  useful  basis  to  describe  the  semantics  imposed  by 
relational  d-separation  and  the  characteristics  of  our  approach.  There  are  two  primary 
concepts: 

(1)  All- ground- graphs  semantics :  It  might  appear  that,  since  the  standard  rules  of  d- 
separation  apply  to  Bayesian  networks  and  the  ground  graphs  of  relational  models  are  also 
Bayesian  networks,  that  applying  d-separation  to  relational  models  is  a  non-issue.  However, 
applying  d-separation  to  a  single  ground  graph  may  result  in  potentially  unbounded  runtime 
if  the  instantiation  is  large  (i.e.,  since  relational  databases  can  be  arbitrarily  large).  Further, 
and  more  importantly,  the  semantics  of  d-separation  require  that  conditional  independencies 
hold  across  all  possible  model  instantiations.  Although  d-separation  can  apply  directly  to 
a  ground  graph,  these  semantics  prohibit  reasoning  about  a  single  ground  graph. 

The  conditional  independence  facts  derived  from  d-separation  hold  for  all  distributions 
represented  by  a  Bayesian  network.  Analogously,  the  implications  of  relational  d-separation 
should  hold  for  all  distributions  represented  by  a  relational  model.  It  is  simple  to  show 
that  these  implications  hold  for  all  ground  graphs  of  a  Bayesian  network — every  ground 
graph  consists  of  a  set  of  disconnected  subgraphs,  each  of  which  has  a  structure  that  is 
identical  to  that  of  the  model.  However,  the  set  of  distributions  represented  by  a  relational 
model  depends  on  both  the  relational  skeleton  (constrained  by  the  schema)  and  the  model 
parameters.  That  is,  the  ground  graphs  of  relational  models  vary  with  the  structure  of  the 
underlying  relational  skeleton  (e.g.,  different  products  are  developed  by  varying  numbers 
of  employees).  As  a  result,  answering  relational  d-separation  queries  requires  reasoning 
without  respect  to  ground  graphs. 

(2)  Perspective-based  analysis:  Relational  models  make  explicit  one  implicit  choice  un¬ 
derlying  nearly  any  form  of  data  analysis.  This  choice — what  we  refer  to  here  as  a  perspec¬ 
tive — concerns  the  selection  of  a  particular  unit  or  subject  of  analysis.  For  example,  in  the 
social  sciences,  a  commonly  used  acronym  is  UTOS ,  for  framing  an  analysis  by  choosing  a 
unit,  treatment,  outcome,  and  setting.  Any  method,  such  as  Bayesian  network  modeling, 
that  assumes  IID  data  makes  the  implicit  assumption  that  the  attributes  on  data  instances 
correspond  to  attributes  of  a  single  unit  or  perspective.  In  the  example,  we  targeted  a 
specific  conditional  independence  regarding  employee  instances  (as  opposed  to  products  or 
business  units). 

The  concept  of  perspectives  is  not  new,  but  it  is  central  to  statistical  relational  learn¬ 
ing  because  relational  data  sets  may  be  heterogeneous,  involving  instances  that  refer  to 
multiple,  distinct  perspectives.  The  inductive  logic  programming  (ILP)  community  has  dis¬ 


cussed  individual-centered  representations  (Flach  1999),  and  many  approaches  to  propo- 


sitionalizing  relational  data  have  been  developed  to  enforce  a  single  perspective  in  order 
to  rely  on  existing  propositional  learning  algorithms  (Kramer  et  al.,  2001).  An  alternative 


strategy  is  to  explicitly  acknowledge  the  presence  of  multiple  perspectives  and  learn  jointly 
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among  them.  This  approach  underlies  many  algorithms  that  learn  the  types  of  probabilistic 
models  of  relational  data  applicable  in  this  work,  e.g.,  learning  the  structure  of  probabilis¬ 
tic  relational  models,  relational  dependency  networks,  or  parametrized  Bayesian  networks 


(Friedman  et  al.  1999;  Neville  and  Jensen,  2007  Schulte  et  al.,  2012). 


Often,  data  sets  are  derivative,  leading  to  little  or  no  choice  about  which  perspectives 
to  analyze.  However,  for  relational  domains,  from  which  these  data  sets  are  derived,  it 
is  assumed  that  there  are  multiple  perspectives,  and  we  can  dynamically  analyze  different 
perspectives.  In  the  example,  we  chose  the  employee  perspective,  and  the  analysis  focused 
on  the  dependence  between  an  employee’s  competence  and  the  revenue  of  business  units 
that  fund  developed  products.  However,  if  the  question  were  posed  from  the  perspective 
of  business  units,  then  we  could  conceivably  condition  on  the  success  of  products  for  each 
business  unit.  In  this  scenario,  reasoning  about  d-separation  at  the  model  level  would  lead  to 
a  correct  conditional  independence  statement.  Some  (though  fairly  infrequent)  d-separation 
queries  produce  accurate  conditional  independence  facts  when  applied  to  relational  model 
structure  (see  Section]?}]).  However,  the  model  is  often  unknown,  a  perspective  may  be  chosen 
a  priori,  and  a  theory  that  is  occasionally  correct  is  clearly  undesirable.  Additionally,  to 
support  constraint-based  learning  algorithms,  it  is  important  to  reason  about  conditional 
independence  implications  from  different  perspectives. 

One  plausible  alternative  approach  would  be  to  answer  d-separation  queries  by  ignoring 
perspectives  and  considering  just  the  attribute  classes  (i.e. ,  reason  about  Competence  and 
Revenue  given  Success).  However,  it  remains  to  define  explicit  semantics  for  grounding  and 
evaluating  the  query  based  on  the  relational  skeleton.  There  are  at  least  three  options: 


Construct  three  sets  of  variables,  including  all  instances  of  competence,  revenue,  and 
success  variables :  Although  the  ground  graph  has  the  semantics  of  a  Bayesian  network, 
there  is  only  a  single  ground  graph — one  data  sample  (Xiang  and  Neville  2011). 


Consequently,  this  analysis  would  be  statistically  meaningless  and  is  the  primary 
reason  why  relational  learning  algorithms  dynamically  generate  propositional  data 
for  each  instance  of  a  given  perspective. 


•  Test  the  Cartesian  product  of  competence  and  revenue  variables,  conditioned  on  all 
success  variables :  Testing  all  pairs  invariably  leads  to  independence.  Moreover,  these 
semantics  are  incoherent;  only  reachable  pairs  of  variables  should  be  compared.  For 
propositional  data,  variable  pairs  are  constructed  by  choosing  attribute  values,  e.g., 
height  and  weight,  within  an  individual.  The  same  is  true  for  relational  data:  Only 
choose  the  success  of  products  for  employees  that  actually  develop  them,  following 
the  underlying  relational  connections. 


•  Test  relationally  connected  pairs  of  competence  and  revenue  variables,  conditioned  on 
all  success  variables:  Again,  this  appears  plausible  based  on  traditional  d-separation; 
every  instance  in  the  table  conditions  on  the  same  set  of  success  values.  Therefore, 
this  is  akin  to  not  conditioning  because  the  conditioning  variable  is  a  constant. 


We  argue  that  the  desired  semantics  are  essentially  the  explicit  semantics  of  perspective- 
based  queries.  Therefore,  we  advocate  perspective-based  analysis  as  the  only  statistically 
and  semantically  meaningful  approach  for  relational  data  and  models. 

Our  approach  to  answering  relational  d-separation  queries  incorporates  the  two  afore¬ 
mentioned  semantics.  In  Section  [5j  we  describe  a  new,  lifted  representation — the  abstract 
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^Success  of  other  products  funded  by  business  units  that  fund  products  developed  by  an  employee 


^  Employee's  competence  ^  ►  ^Success  of  products  developed  by  an  employee)  ►  ^  Revenue  of  business  units  that  fund  products  developed  by  an  employee) 


^Co-workers'  competence 


Success  of  other  products  developed  by  co-workers 

n 

Success  of  other  products  funded  by  business  units  that  fund  products  developed  by  an  employee 


^Success  of  other  products  developed  by  co-workers ^ 


Figure  3.1:  Example  abstract  ground  graph  from  the  perspective  of  employees.  Nodes  are 
labeled  with  their  intuitive  meaning. 


ground  graph — that  is  provably  sound  and  complete  in  its  abstraction  of  all  ground  graphs 
for  a  given  relational  model.  As  their  name  suggests,  abstract  ground  graphs  abstract  all 
ground  graphs  of  a  relational  model,  representing  any  potential  relationally  d-connecting 
path  (recall  the  example  d-connecting  path  that  only  manifests  in  the  ground  graph).  A 
relational  model  has  a  corresponding  set  of  abstract  ground  graphs,  one  for  each  perspective 
(i.e.,  entity  or  relationship  class  in  its  underlying  schema),  and  can  be  used  to  reason  about 
relational  d-separation  with  respect  to  any  given  perspective.  Figure  3.1  shows  a  fragment 
of  an  abstract  ground  graph  from  the  employee  perspective  for  the  model  in  Figure  2.2a| 
The  nodes  are  depicted  with  their  intuitive  meaning  rather  than  their  actual  syntax  for  this 
example.  Representational  details  and  accompanying  theory  are  presented  in  Section  [5j 


4.  Concepts  of  Relational  Data  and  Models 


Propositional  representations  describe  domains  with  a  single  entity  type,  but  many  real- 
world  systems  involve  multiple  types  of  interacting  entities  with  probabilistic  dependencies 
among  their  variables.  For  example,  in  the  model  in  Figure  2.^|(a)  the  competence  of 


employees  affects  the  success  of  products  they  develop.  Many  researchers  have  focused  on 
modeling  such  domains,  which  are  generally  characterized  as  relational.  These  relational 
representations  can  be  divided  into  two  main  categories:  probabilistic  graphical  models — 
such  as  probablistic  relational  models  (PRMs)  (Roller  and  Pfeffer,  1998),  directed  acyclic 
probabilistic  entity-relationship  (DAPER)  models  (Heckerman  et  ah,  2004),  and  relational 
Markov  networks  (RMNs)  (Taskar  et  al.|  |2002) — and  probabilistic  logic  models — such  as 
Bayesian  logic  programs  (BLPs)  (Kersting  and  De  Raedt,  2002),  Markov  logic  networks 


(MLNs)  (Richardson  and  Domingos,  2006),  parametrized  Bayesian  networks  (PBNs)  (Poole[ 


2003),  Bayesian  logic  (Blog)  (Milch  et  ah,  2005),  multi-entity  Bayesian  networks  (MEBNs) 


(Laskey,  2008),  and  relational  probability  models  (RPMs)  (Russell  and  Norvig,  2010). 


To  facilitate  an  extension  to  the  graphical  criterion  of  d-separation,  we  currently  focus 
on  directed,  acyclic,  graphical  models  of  conditional  independence.  As  most  of  the  above 
models  have  similar  expressive  power,  the  results  in  this  paper  could  generalize  across 
representations — even  for  undirected  relational  models,  such  as  RMNs  and  MLNs,  after 
moralization.  However,  we  found  it  simpler  to  define  and  prove  relevant  theoretical  prop¬ 
erties  for  relational  d-separation  in  a  representation  most  similar  to  Bayesian  networks.  In 
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this  section,  we  formally  define  the  concepts  of  relational  data  and  models  using  a  similar 
representation  to  PRMs  and  DAPER  models. 

A  relational  schema  is  a  top-level  description  of  what  data  exist  in  a  particular  domain. 
Specifically  (adapted  from  Heckerman  et  al.,  2007): 


Definition  4.1  (Relational  schema)  A  relational  schema  S  =  [£,1Z,A,  card)  consists 
of  a  set  of  entity  classes  £  =  {E\, . . . ,  Em}\  a  set  of  relationship  classes  77  =  {Ri, . . . ,  Rn}, 
where  each  Ri  =  (E\ , . . . ,  .),  with  E)  G  £  and  a*  is  the  arity  for  Rp,  a  set  of  attribute 
classes  A(I)  for  each  item  class  I  G  £  U  77;  and  a  cardinality  function  card  :  77  x  £  — » 
{one,  many}. 


A  relational  schema  can  be  represented  graphically  with  an  entity-relationship  (ER) 
diagram.  We  adopt  a  slightly  modified  ER  diagram  using  Barker’s  notation  (1990),  where 
entity  classes  are  rectangular  boxes,  relationship  classes  are  diamonds  with  dashed  lines 
connecting  their  associated  entity  classes,  attribute  classes  are  ovals  residing  on  entity  and 
relationship  classes,  and  cardinalities  are  represented  with  crow’s  foot  notation. 


Example  4.1  The  relational  schema  S  for  the  organization  domain  example  depicted  in 
Figure  2. l]|(a)  consists  of  entities  £  =  {Employee,  Product,  Business-Unit};  relation¬ 
ships  77  =  {Develops,  Funds},  where  Develops  =  (Employee,  Product),  Funds 
=  (Business-Unit,  Product)  and  having  cardinalities  card  (Develops,  Employee)  = 
many,  card (Develops,  Product)  =  many,  card(FuNDS,  Business-Unit)  =  many,  and 
card(FuNDS,  Product)  =  ONE;  and  attributes  A(Employee)  =  {Competence,  Salary}, 
^.(Product)  =  {Success},  and  ^.(Business-Unit)  =  {Budget,  Revenue}.  □ 


A  relational  schema  is  a  template  for  a  relational  skeleton  (also  referred  to  as  a  data 


graph  by  Neville  and  Jensen,  2007),  an  instantiation  of  entity  and  relationship  classes. 


Specifically  (adapted  from  Heckerman  et  al. ,  2007): 


Definition  4.2  (Relational  skeleton)  A  relational  skeleton  a  for  relational  schema  S  = 
{£ ,77,  A,  card)  specifies  a  set  of  entity  instances  cr(E)  for  each  E  G  £  and  relationship 
instances  c{R)  for  each  R  G  77.  Relationship  instances  adhere  to  the  cardinality  constraints 
of  S:  If  card(R,  E)  =  ONE,  then  for  each  e  G  cr(E)  there  is  at  most  one  r  G  c{R)  such  that 
e  participates  in  r. 


For  convenience,  we  use  the  notation  E  G  R  if  entity  class  E  is  a  component  of  rela¬ 
tionship  class  R,  and,  similarly,  e  G  r  if  entity  instance  e  is  a  component  of  the  relationship 
instance  r.  We  also  denote  the  set  of  all  skeletons  for  schema  S  as  E5. 


Example  4.2  The  relational  skeleton  a  for  the  organization  example  is  depicted  in  Fig- 
The  sets  of  entity  instances  are  a  (Employee)  =  {Paul,  Quinn,  Roger,  Sally, 


ure 


mn 


Thomas},  cj(Product)  =  {Case,  Adapter,  Laptop,  Tablet,  Smartphone},  and  cj(Business- 
Unit)  =  {Accessories,  Devices}.  The  sets  of  relationship  instances  are  ct(Develops)  = 
{(Paul,  Case),  (Quinn,  Case),  . . . ,  (Thomas,  Smartphone)}  and  ct(Funds)  =  {(Accessories, 
Case),  (Accessories,  Adapter),  . . . ,  (Devices,  Smartphone)}.  The  relationship  instances  ad¬ 
here  to  their  cardinality  constraints  (e.g.,  Funds  is  a  ONE-to-MANY  relationship — within 
cj(Funds),  every  product  has  a  single  business  unit,  and  every  business  unit  may  have 
multiple  products).  □ 
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In  order  to  specify  a  model  over  a  relational  domain,  we  must  define  a  space  of  possible 
variables  and  dependencies.  Consider  the  example  dependency  [PRODUCT,  Develops,  Em¬ 
ployee].  Competence  — >  [Product], Success  from  the  model  in  Figure  HHE  expressing 
that  the  competence  of  employees  developing  a  product  affects  the  success  of  that  product. 
For  relational  data,  the  variable  space  includes  not  only  intrinsic  entity  and  relationship 
attributes  (e.g.,  success  of  a  product),  but  also  the  attributes  on  other  entity  and  relation¬ 
ship  classes  that  are  reachable  by  paths  along  the  relational  schema  (e.g.,  the  competence 
of  employees  that  develop  a  product).  We  define  relational  paths  to  formalize  the  notion  of 
which  item  classes  are  reachable  on  the  schema  from  a  given  item  class0 


Definition  4.3  (Relational  path)  A  relational  path  [Ij, ...  ,1^1  for  relational  schema  S 
is  an  alternating  sequence  of  entity  and  relationship  classes  Ij, ...  ,1k  G  £  U  77  such  that: 

(1)  For  every  pair  of  consecutive  item  classes  [E.  R]  or  [R,  E ]  in  the  path,  E  G  R. 

(2)  For  every  triple  of  consecutive  item  classes  [E,  R,  E1],  E  ^ 

(3)  For  every  triple  of  consecutive  item  classes  [R,  E,  R'],  if  R  =  R' ,  then  card(R,E)  = 
MANY. 

Ij  is  called  the  base  item,  or  perspective ,  of  the  relational  path. 


Condition  (1)  enforces  that  entity  classes  participate  in  adjacent  relationship  classes  in 
the  path.  Conditions  (2)  and  (3)  remove  any  paths  that  would  invariably  reach  an  empty 
terminal  set  (see  Definition  4.4  and  Appendix  0.  This  definition  of  relational  paths  is 
similar  to  “meta-paths”  and  “relevance  paths”  in  similarity  search  and  information  retrieval 
in  heterogeneous  networks  (Sun  et  al.,  2011|  Shi  et  al(|  2012).  Relational  paths  also  extend 
the  notion  of  “slot  chains”  from  the  PRM  framework  (Getoor  et  al.  2007)  by  including 


cardinality  constraints  and  formally  describing  the  semantics  under  which  repeated  item 
classes  may  appear  on  a  path.  Relational  paths  are  also  a  specialization  of  the  first-order 


constraints  on  arc  classes  imposed  on  DAPER  models  (Heckerman  et  al.,  2007) 


Example  4.3  Consider  the  example  relational  schema  in  Figure  2.1  a)  Some  example 


relational  paths  from  the  Employee  perspective  (with  an  intuitive  meaning  of  what  the 
paths  describe)  include  the  following:  [Employee]  (an  employee),  [Employee,  Devel¬ 
ops,  Product]  (products  developed  by  an  employee),  [Employee,  Develops,  Product, 
Funds,  Business-Unit]  (business  units  of  the  products  developed  by  an  employee),  and 
[Employee,  Develops,  Product,  Develops,  Employee]  (co-workers  developing  the 
same  products).  Invalid  relational  paths  include  [Employee,  Develops,  Employee]  (be¬ 
cause  Employee=Employee  and  Develops  g  77)  and  [Business-Unit,  Funds,  Prod¬ 
uct,  Funds,  Business-Unit]  (because  Product  g  £  and  card(FuNDS,  Product)  = 
one).  □ 


Relational  paths  are  defined  at  the  level  of  relational  schemas,  and  as  such  are  templates 
for  paths  in  a  relational  skeleton.  An  instantiated  relational  path  produces  a  set  of  traversals 

2.  Because  the  term  “path”  is  also  commonly  used  to  describe  chains  of  dependencies  in  graphical  models, 
we  will  explicitly  qualify  each  reference  to  avoid  ambiguity. 

3.  This  condition  suggests  at  first  glance  that  self-relationships  (e.g.,  employees  manage  other  employees, 
individuals  in  social  networks  maintain  friendships,  scholarly  articles  cite  other  articles)  are  prohibited. 
We  discuss  this  and  other  model  assumptions  in  Section  [8l 
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on  a  relational  skeleton.  However,  the  quantity  of  interest  is  not  the  traversals,  but  the  set 
of  reachable  item  instances  (i.e. ,  entity  or  relationship  instances).  These  reachable  instances 
are  the  fundamental  elements  that  support  model  instantiations  (i.e.,  ground  graphs). 

Definition  4.4  (Terminal  set)  For  skeleton  a  £  S5  and  ij  £  a (Ij),  the  terminal  set  P|j. 
for  relational  path  P  =  [Ij, ... ,  4]  of  length  n  is  defined  inductively  as 

Pl\ij  =  [Ij\ lb'  = 


Pn\ij  =  [Ij,  ■  ■  ■ ,  IkWij  =  IJ  |*fc  I  ((*m  G  4  if  4  G  U)  V  (4  £  zm  if  4  £  £)) 

im£Pn-1\ij  n~1  . 

A  4  i  J  i^lij  j 


1=1 

A  terminal  set  of  a  relational  path  P  =  [Ij,...,  4]  consists  of  instances  of  class  4; 
the  terminal  item  on  the  path.  Conceptually,  a  terminal  set  is  produced  by  traversing  a 
skeleton  beginning  at  a  single  instance  of  the  base  item  class,  ij  £  cr(Ij),  following  instances 
of  the  item  classes  in  the  relational  path,  and  reaching  a  set  of  instances  of  class  4-  The 
term  4  (f  U/J  -Pj  in  the  definition  implies  a  “bridge  burning”  semantics  under  which 
no  item  instances  are  revisited  (4  does  not  appear  in  the  terminal  set  of  any  prefix  of 
P)j^]  The  notion  of  terminal  sets  is  a  necessary  concept  for  grounding  any  relational  model 
and  has  been  described  in  previous  work — e.g.,  for  PRMs  (Getoor  et  al.  2007)  and  MLNs 
( Richardson  and  Domingos 


2006) — but  has  not  been  explicitly  named.  We  emphasize  their 


importance  because  terminal  sets  are  also  critical  for  defining  relational  d-separation,  and 
we  formalize  the  semantics  for  bridge  burning. 


Example  4.4  We  can  generate  terminal  sets  by  pairing  the  set  of  relational  paths  for 
the  schema  in  Figure  2.l|a)  with  the  relational  skeleton  in  Figure  2.1[b)  Let  Quinn  be 
our  base  item  instance.  Then  [Employee] | QUinn  =  {Quinn},  [Employee,  Develops, 
Product] | Quinn  =  {Case,  Adapter,  Laptop},  [Employee,  Develops,  Product,  Funds, 
Business-Unit] |QUinn  =  {Accessories,  Devices},  and  [Employee,  Develops,  Product, 
Develops,  Employee] [q„ iT1n  =  {Paul,  Roger,  Sally}.  The  bridge  burning  semantics  en¬ 
force  that  Quinn  is  not  also  included  in  this  last  terminal  set.  □ 


For  a  given  base  item  class,  it  is  common  (depending  on  the  schema)  for  distinct  rela¬ 
tional  paths  to  reach  the  same  terminal  item  class.  The  following  lemma  states  that  if  two 
relational  paths  with  the  same  base  item  and  the  same  terminal  item  differ  at  some  point  in 
the  path,  then  for  some  relational  skeleton  and  some  base  item  instance,  their  terminal  sets 
will  have  a  non-empty  intersection.  This  property  is  important  to  consider  for  relational 
d-separation. 


Lemma  4.1  For  two  relational  paths  of  arbitrary  length  from  Ij  to  4  that  differ  in  at  least 
one  item  class,  P\  =  [Ij, . . . ,  Im, . . . ,  4]  and  P2  =  [Ij,  4]  with  Im  4  4 ,  there 

exists  a  skeleton  a  £  E5  such  that  Pi\ij  H  /  0  for  some  ij  £  <7(4). 

4.  The  bridge  burning  semantics  yield  terminal  sets  that  are  necessarily  subsets  of  terminal  sets  that  would 
otherwise  be  produced  without  bridge  burning.  Although  this  appears  to  be  limiting,  it  actually  enables 
a  strictly  more  expressive  class  of  relational  models.  See  Appendix |B]  for  more  details  and  an  example. 
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Proof.  See  Appendix  [Aj 

Example  4.5  Let  Pi  =  [Employee,  Develops,  Product,  Develops,  Employee,  De¬ 
velops,  Product],  the  terminal  sets  for  which  yield  other  products  developed  by  collabo¬ 
rating  employees.  Let  P2  =  [Employee,  Develops,  Product,  Funds,  Business-Unit, 
Funds,  Product],  the  terminal  sets  for  which  consist  of  other  products  funded  by  the 
business  units  funding  products  developed  by  a  given  employee.  Intersection  among  termi¬ 
nal  sets  for  these  paths  occurs  even  in  the  small  example  skeleton.  In  fact,  the  intersection 
of  the  terminal  sets  for  Pi  and  P2  is  non-empty  for  all  employees.  For  example,  Paul: 
Pi | Paul  =  {Adapter,  Laptop}  and  P2|paui  =  {Adapter};  Quinn:  Pi|QUinn  =  {Tablet}  and 
P2 1 Quinn  =  {Tablet,  Smartphone}.  □ 

Given  the  definition  for  relational  paths,  it  is  simple  to  define  relational  variables  and 
their  instances. 


Definition  4.5  (Relational  variable)  A  relational  variable  [Ij, . . . ,  Ik]-X  consists  of  a 
relational  path  [Ij, . . . ,  /*,]  and  an  attribute  class  X  E  A(Ik). 


As  with  relational  paths,  we  refer  to  Ij  as  the  perspective  of  the  relational  variable. 
Relational  variables  are  templates  for  sets  of  random  variables  (see  Definition  4.6).  Sets  of 
relational  variables  are  the  basis  of  relational  d-separation  queries,  and  consequently  they 
are  also  the  nodes  of  the  abstract  representation  that  answers  those  queries.  There  is  an 
equivalent  formulation  in  the  PRM  framework,  although  not  explicitly  named  (they  are 
simply  denoted  as  attribute  classes  of  K-related  item  classes  via  slot  chain  K).  As  they  are 
critical  to  relational  d-separation,  we  provide  this  concept  with  an  explicit  designation. 


Example  4.6  Relational  variables  for  the  relational  paths  in  Example |4.3| include  intrinsic 
attributes  such  as  [Employee] . Competence  and  [Employee]. Salary,  and  also  attributes 
on  related  entity  classes  such  as  [Employee,  Develops,  Product], Success,  [Employee, 
Develops,  Product,  Funds,  Business-Unit], Revenue,  and  [Employee,  Develops, 
Product,  Develops,  Employee]. Salary.  □ 

Definition  4.6  (Relational  variable  instance)  For  skeleton  a  E  Es  and  ij  E  cr(Ij),  a 
relational  variable  instance  [Ij, ... ,  Ik].X\ij  for  relational  variable  [Ij, ,  Ik\-X  is  the  set  of 
random  variables  {ik-X  \  X  £A(Ik)  A  i\-  E  [Ij, . . . ,  Ik\\ij  A  ifcE<7(/fc)}. 

To  instantiate  a  relational  variable  [Ij , . . . ,  Ij-\ . X  for  a  specific  base  item  instance  ij, 
we  first  find  the  terminal  set  of  the  underlying  relational  path  [Ij, . . .  ,Ik\\ij  and  then  take 
the  X  attributes  of  the  Ik  item  instances  in  that  terminal  set.  This  produces  a  set  of 
random  variables  ik-X,  which  also  correspond  to  nodes  in  the  ground  graph.  As  a  notational 
convenience,  if  X  is  a  set  of  relational  variables,  all  from  a  common  perspective  Ij,  then  we 
say  that  X|j.  for  some  item  ij  E  a ( I j)  is  the  union  of  all  instantiations,  { x  |  xEX|j.  A  X  E 
X}. 

Example  4.7  Instantiating  the  relational  variables  from  Example  |4.6|  with  base  item 
instance  Sally  yields  [Employee], Competence\ saiiy  =  {Sally. Competence},  [Employee, 
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Develops,  Product], Success |saiiy  =  {Laptop. Success,  Tablet. Success},  [Employee, 
Develops,  Product,  Funds,  Business-Unit], Revenue\Sany  =  { Devices. Revenue},  and 
[Employee,  Develops,  Product,  Develops,  Employee], Salary | sauy  =  {Quinn. Salary, 
Thomas. Salary}.  □ 

Given  the  definitions  for  relational  variables,  we  can  now  define  relational  dependencies. 


Definition  4.7  (Relational  dependency)  A  relational  dependency  [Ij, . . .  ,Ik]-Y  — > 
[Ij\-X  is  a  directed  probabilistic  dependence  from  attribute  class  Y  to  X  through  the  rela¬ 
tional  path  [Ij, ,  Ik]. 


Depending  on  the  context,  [Ij, . . . , Ik[.Y  and  [Ij\.X  can  be  referred  to  as  treatment 
and  outcome,  cause  and  effect,  or  parent  and  child.  A  relational  dependency  consists  of 
two  relational  variables  having  a  common  perspective.  The  relational  path  of  the  child  is 
restricted  to  a  single  item  class,  ensuring  that  the  terminal  sets  consist  of  a  single  value. 
This  is  consistent  with  PRMs,  except  that  we  explicitly  delineate  dependencies  rather  than 
define  parent  sets  of  relational  variables.  Note  that  relational  variables  are  not  nodes  in  a 
relational  model,  but  they  form  the  space  of  parent  variables  for  relational  dependencies. 
The  relational  path  specification  (before  the  attribute  class  of  the  parent)  is  equivalent  to 
a  slot  chain,  as  in  PRMs,  or  the  logical  constraint  on  a  dependency,  as  in  DAPER  models. 


can 


Example  4.8  The  dependencies  in  the  relational  model  displayed  in  Figure  2.  M 
be  specified  as:  [Product,  Develops,  Employee]. Competence  — >  [Product], Success 
(product  success  is  influenced  by  the  competence  of  the  employees  developing  the  product), 
[Employee]. Competence  — >  [Employee]. Salary  (an  employee’s  competence  affects  his  or 
her  salary),  [Business-Unit,  Funds,  Product]. Success  — >  [Business-Unit], Revenue 
(the  success  of  the  products  funded  by  a  business  unit  influences  that  unit’s  revenue), 
[Employee,  Develops,  Product,  Funds,  Business-Unit], Budget— ^[Employee]. Saffiry 
(employee  salary  is  governed  by  the  budget  of  the  business  units  for  which  they  develop 
products),  and  [Business-Unit], Revenue  — >  [Business-Unit], Budget  (the  revenue  of  a 
business  unit  influences  its  budget).  □ 


We  now  have  sufficient  information  to  define  relational  models. 


Definition  4.8  (Relational  model)  A  relational  model  A4q  consists  of  two  parts: 

1.  The  structure  M.  =  ( S,V ):  a  schema  S  paired  with  a  set  of  relational  dependencies 
V  defined  over  S. 

2.  The  parameters  0:  a  conditional  probability  distribution  P([Ij].X  \  parents  ([I j].X)) 
for  each  relational  variable  of  the  form  [If.X ,  where  G  £UK,  I  G  „4,(/j)  and 
parents  ([Ij]. X)  =  { [Ij, . . . ,  I]f[.Y  \  [Ij, . . . ,  Ik\-Y  -A  [Ij\-X  £  T>}  is  the  set  of  parent 
relational  variables. 


The  structure  of  a  relational  model  can  be  represented  graphically  by  superimposing 
dependencies  on  the  ER  diagram  of  a  relational  schema  (see  Figure  2.^(a)  for  an  example). 
A  relational  dependency  of  the  form  [Ij, . . .  ,Ik\.Y  —A  [Ij\.X  is  depicted  as  a  directed  arrow 
from  attribute  class  Y  to  X  with  the  specification  listed  separately.  Note  that  the  subset 
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of  relational  variables  with  singleton  paths  [I]-X  in  the  definition  correspond  to  the  set  of 
attribute  classes  in  the  schema. 

A  common  technique  in  relational  learning  is  to  use  aggregation  functions  to  transform 
parent  multi-sets  to  single  values  within  the  conditional  probability  distributions.  Typically, 
aggregation  functions  are  simple,  such  as  mean  or  mode,  but  they  can  be  complex,  such  as 
those  based  on  vector  distance  or  object  identifiers,  as  in  the  ACORA  system  (Perlich  and 
Provost,  2006).  However,  aggregates  are  a  convenience  for  increasing  power  and  accuracy 


during  learning,  but  they  are  not  necessary  for  model  specification. 

This  definition  of  relational  models  is  consistent  with  and  yields  structures  expressible 


as  DAPER  models  (Heckerman  et  al.  2007).  These  relational  models  are  also  equivalent  to 


PRMs,  but  we  extend  slot  chains  as  relational  paths  and  provide  a  formal  semantics  for  their 
instantiation.  These  models  are  also  more  general  than  plate  models  because  dependencies 
can  be  specified  with  arbitrary  relational  paths  as  opposed  to  simple  intersections  among 


plates  (Buntine  1994;  Gilks  et  al.  1994) 


Just  as  the  relational  schema  is  a  template  for  skeletons,  the  structure  of  a  relational 
model  can  be  viewed  as  a  template  for  ground  graphs:  dependencies  applied  to  skeletons. 


Definition  4.9  (Ground  graph)  The  ground  graph  GGm<?  =  (V,E)  for  relational  model 
structure  M.  =  (S,V)  and  skeleton  a  6  is  a  directed  graph  with  nodes  V  =  {i. X  \  1 6 


£  U  1Z  A  X  G  A(I)  A  i£  <r(/)}  and  edges  E  =  { ik-Y 
[Ij,  •  •  •  M-Ylij  A  •  •  ■  ,Ik]-Y  [Ij }-XeV}. 


ij.X 


ik-Y,ij-X  eV  A  ik-Y  e 


A  ground  graph  is  a  directed  graph  with  (1)  a  node  (random  variable)  for  each  attribute 
of  every  entity  and  relationship  instance  in  a  skeleton  and  (2)  an  edge  from  ik-Y  to  ij-X 
if  they  belong  to  the  parent  and  child  relational  variable  instances,  respectively,  of  some 
dependency  in  the  model.  The  concept  of  a  ground  graph  appears  for  any  type  of  relational 
model,  graphical  or  logic-based.  For  example,  PRMs  produce  “ground  Bayesian  networks” 
that  are  structurally  equivalent  to  ground  graphs,  and  Markov  logic  networks  yield  ground 


Markov  networks  by  applying  all  formulas  to  a  set  of  constants  (Richardson  and  Domingos 


2006).  The  example  ground  graph  shown  in  Figure  2.^[b)  is  the  result  of  applying  the 
dependencies  in  the  relational  model  shown  in  Figure  2.^[a)  to  the  skeleton  in  Figure  2.1  T>) 


Similar  to  Bayesian  networks,  given  the  parameters  of  a  relational  model,  a  parameterized 
ground  graph  can  express  a  joint  distribution  that  factors  as  a  product  of  the  conditional 
distributions: 


P{GGMea)  —  n  n  n  P(i. X  |  parents{i.X )) 

iesunxeAG)  i&o\i) 


where  each  i.X  is  assigned  the  conditional  distribution  defined  for  [I]-X  (a  process  referred 
to  as  parameter- tying) . 

Relational  models  only  define  coherent  joint  probability  distributions  if  they  produce 
acyclic  ground  graphs.  A  useful  construct  for  checking  model  acyclicity  is  the  class  depen¬ 
dency  graph  (Getoor  et  al.,  2007),  defined  as: 


Definition  4.10  (Class  dependency  graph)  The  class  dependency  graph  =  (V,E) 
for  relational  model  structure  Ai  =  [S,  D)  is  a  directed  graph  with  a  node  for  each  attribute 
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of  every  item  class  V  =  { I.X  |  U  77  A  X  e.4,(/)}  and  edges  between  pairs  of  attributes 
supported  by  relational  dependencies  in  the  model  E  =  { I^.Y  — y  Ij.X  \  [Ij, . . . , Ik\-Y  -> 

[i,].xev}: 


If  the  relational  dependencies  form  an  acyclic  class  dependency  graph,  then  every  pos¬ 


sible  ground  graph  of  that  model  is  acyclic  as  well  (Getoor  et  ah,  2007).  Given  an  acyclic 


relational  model,  the  ground  graph  has  the  same  semantics  as  a  Bayesian  network  (Getoor 


2001.  Heckerman  et  al.  2007).  All  future  references  to  acyclic  relational  models  refer  to 


relational  models  whose  structure  forms  acyclic  class  dependency  graphs. 

one  relational  dependency  may  imply  dependence 


By  Lemma  4.1  and  Definition  |4.9[ 
between  the  instances  of  many  relational  variables.  If  there  is  an  edge  from  i^.Y  to  ij.X  in 
the  ground  graph,  then  there  is  an  implied  dependency  between  all  relational  variables  for 
which  ik-Y  and  ij-X  are  elements  of  their  instances. 


Example  4.9  The  relational  dependency  [Employee], Competence— ^[Employee]. Salary 
yields  the  edge  Roger.  Competence  -A  Roger. Salary  in  the  ground  graph  of  Figure  24  |b) 
because  Roger. Competence  E  [Employee], Competence! Roger-  However,  Roger. Competence 
E  [Employee,  Develops,  Product,  Develops,  Employee], Competence | sally  (as  is 
Roger. Salary,  replacing  Competence  with  Salary).  Consequently,  the  relational  dependency 
implies  dependence  among  the  random  variables  in  the  instances  of  [Employee,  Develops, 
Product,  Develops,  Employee \. Competence  and  [Employee,  Develops,  Product, 
Develops,  Employee], Salary.  □ 


These  implied  dependencies  form  the  crux  of  the  challenge  of  identifying  independence  in 
relational  models.  Additionally,  the  intersection  between  the  terminal  sets  of  two  relational 
paths  is  crucial  for  reasoning  about  independence  because  a  random  variable  can  belong 
to  the  instances  of  more  than  one  relational  variable.  Since  d-separation  only  guarantees 
independence  when  there  are  no  d-connecting  paths,  we  must  consider  all  possible  paths 
between  pairs  of  random  variables,  either  of  which  may  be  a  member  of  multiple  relational 
variable  instances.  In  Section  [5j  we  define  relational  d-separation  and  provide  an  appro¬ 
priate  representation,  the  abstract  ground  graph,  that  enables  straightforward  reasoning 
about  d-separation. 


5.  Relational  d-Separation 

Conditional  independence  facts  are  correctly  entailed  by  the  rules  of  d-separation,  but 
only  when  applied  to  the  graphical  structure  of  Bayesian  networks.  Every  ground  graph 
of  a  Bayesian  network  consists  of  a  set  of  identical  copies  of  the  model  structure  (see  Ap¬ 
pendix  [P]).  Thus,  the  implications  of  d-separation  on  Bayesian  networks  hold  for  all  instances 
in  every  ground  graph.  In  contrast,  the  structure  of  a  relational  model  is  a  template  for 
ground  graphs,  and  the  structure  of  a  ground  graph  varies  with  the  underlying  skeleton 
(which  is  typically  more  complex  than  a  set  of  disconnected  instances).  Conditional  inde¬ 
pendence  facts  are  only  useful  when  they  hold  across  all  ground  graphs  that  are  consistent 
with  the  model,  which  leads  to  the  following  definition: 
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Definition  5.1  (Relational  d-separation)  Let  X,  Y,  and  Z  be  three  distinct  sets  of 
relational  variables  with  the  same  perspective  B  E  £  U 1Z  defined  over  relational  schema  S. 
Then,  for  relational  model  structure  Ai,  X  and  Y  are  d-separated  by  Z  if  and  only  if,  for 
all  skeletons  a  E  £5,  X|&  and  Y|&  are  d-separated  by  Z|^  in  ground  graph  GGjvia  f°r  all 
b  E  cr(L>). 


For  any  relational  d-separation  query,  it  is  necessary  that  all  relational  variables  in  X, 
Y,  and  Z  have  the  same  perspective  (otherwise,  the  query  would  be  incoherent)^]  For  X  and 
Y  to  be  d-separated  by  Z  in  relational  model  structure  M,  d-separation  must  hold  for  all 
instantiations  of  those  relational  variables  for  all  possible  skeletons.  This  is  a  conservative 
definition,  but  it  is  consistent  with  the  semantics  of  d-separation  on  Bayesian  networks — it 
guarantees  independence,  but  it  does  not  guarantee  dependence.  If  there  exists  even  one 
skeleton  and  faithful  distribution  represented  by  the  relational  model  for  which  X  J/L  Y  |  Z, 
then  X  and  Y  are  not  d-separated  by  Z. 

Given  the  semantics  specified  in  Definition  5.1  answering  relational  d-separation  queries 
is  challenging  for  several  reasons: 

D-separation  must  hold  over  all  ground  graphs :  The  implications  of  d-separation  on 
Bayesian  networks  hold  for  all  possible  ground  graphs.  However,  the  ground  graphs  of 
a  Bayesian  network  consist  of  identical  copies  of  the  structure  of  the  model,  and  it  is 
sufficient  to  reason  about  d-separation  on  a  single  subgraph.  Although  it  is  possible  to 
verify  d-separation  on  a  single  ground  graph  of  a  relational  model,  the  conclusion  may  not 
generalize,  and  ground  graphs  can  be  arbitrarily  large. 

Relational  models  are  templates:  The  structure  of  a  relational  model  is  a  directed  acyclic 
graph,  but  the  dependencies  are  actually  templates  for  constructing  ground  graphs.  The 
rules  of  d-separation  do  not  directly  apply  to  relational  models,  only  to  their  ground  graphs. 
Applying  the  rules  of  d-separation  to  a  relational  model  frequently  leads  to  incorrect  con¬ 
clusions  because  of  unrepresented  d-connecting  paths  that  are  only  manifest  in  ground 
graphs. 

Instances  of  relational  variables  may  intersect:  The  instances  of  two  different  relational 
variables  may  have  non-empty  intersections,  as  described  by  Lemma[4.1|  These  intersections 
may  be  involved  in  relationally  d-connecting  paths,  such  as  the  example  in  Section  [2]  As 
a  result,  a  sound  and  complete  approach  to  answering  relational  d-separation  queries  must 
account  for  these  paths. 

Relational  models  may  be  specified  from  multiple  perspectives:  Relational  models  are  de¬ 
fined  by  relational  dependencies,  each  specified  from  a  single  perspective.  However,  variables 
in  a  ground  graph  may  contribute  to  multiple  relational  variable  instances,  each  defined  from 
a  different  perspective.  Thus,  reasoning  about  implied  dependencies  between  arbitrary  re¬ 
lational  variables,  such  as  the  one  described  in  Example  |4.9|  requires  a  method  to  translate 
dependencies  across  perspectives. 


5.1  Abstracting  over  All  Ground  Graphs 

The  definition  of  relational  d-separation  and  its  challenges  suggest  a  solution  that  abstracts 
over  all  possible  ground  graphs  and  explicitly  represents  the  potential  intersection  between 

5.  This  trivially  holds  for  d-separation  in  Bayesian  networks  as  all  “propositional”  variables  have  the  same 
implicit  perspective. 
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pairs  of  relational  variable  instances.  We  introduce  a  new  lifted  representation,  called  the 
abstract  ground  graph,  that  captures  all  dependencies  among  arbitrary  relational  variables 
for  all  ground  graphs,  using  knowledge  of  only  the  schema  and  the  model.  To  represent  all 
dependencies,  the  construction  of  an  abstract  ground  graph  uses  the  extend  method,  which 
maps  a  relational  dependency  to  a  set  of  implied  dependencies  for  different  perspectives. 
Each  abstract  ground  graph  of  a  relational  model  is  defined  with  respect  to  a  given  perspec¬ 
tive  and  can  be  used  to  reason  about  relational  d-separation  queries  for  that  perspective. 


Definition  5.2  (Abstract  ground  graph)  An  abstract  ground  graph  AGGmb  =  (V7,  E) 
for  relational  model  structure  M.  =  ( S ,  V)  and  perspective  B  £  8  U  1Z  is  a  directed  graph 
that  abstracts  the  dependencies  V  for  all  ground  graphs  GGm<t ,  where  a  £  £5. 

The  set  of  nodes  in  AGGmb  is  V  =  RV  U  IV ,  where 

•  RV  is  the  set  of  all  relational  variables  of  the  form  [ B , . . . ,  Ij].X 

•  IV  is  the  set  of  all  pairs  of  relational  variables  that  could  have  non-empty  intersections 
(referred  to  as  intersection  variables): 

{RVi  HRV2  I  RV1,RV2£RV  A  RV^  =  [B, . . . ,  4, . . . ,  Ij].X 

A  RV2  =  [B,...,Il,...,Ij].X  A  h^h) 


The  set  of  edges  in  AGGmb  is  E  =  RVE  U  IVE,  where 

•  RVE  C  RV  X  RV  is  the  set  of  edges  between  pairs  of  relational  variables: 

RVE  =  { [B, . . . ,  Ik}.Y  ->  [B,...,Ij].X  |  Ik].Y  — >  [I,].X  e  V  A 

[B, .  ..,4]  €  extend ([B, . .  .,Ij\,  [Ij, . . .  ,4])} 

•  IVE  C  IV  x  RV  U  RV  x  IV  is  the  set  of  edges  inherited  from  both  relational  variables 
involved  in  every  intersection  variable  in  IV : 

IVE  =  {Y  ->  [B, . . . ,  Ij].X  |  Y  =  Pt.Y  n  P2.Y  £  IV  A 

(Pi.E  Ij\ .X  £  RVE  V 

P2.Y->[B,...,Ij].X  £  RVE)} 

u 

4].y  -a  x  \  x  =  P\.x  n  p2.x  £  iv  a 

([B, Ik).Y  -A  Pl.X  £  RVE  V 
[B,...,Ik].Y  ^  P2.X  £  RVE)} 


The  extend  method  is  described  in  Definition  5.3  below.  Essentially,  the  construction  of 
an  abstract  ground  graph  for  relational  model  structure  Xi  and  perspective  B  £  SWR.  follows 
three  simple  steps:  (1)  Add  a  node  for  all  relational  variables  from  perspective  P|^]  (2)  Insert 
edges  for  the  direct  causes  of  every  relational  variable  by  translating  the  dependencies  in  V 
using  extend.  (3)  For  each  pair  of  potentially  intersecting  relational  variables,  create  a  new 
node  that  inherits  the  direct  causes  and  effects  from  both  participating  relational  variables 
in  the  intersection.  Then,  to  answer  queries  of  the  form  “Are  X  and  Y  d-separated  by 
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Z?”  simply  (1)  augment  X,  Y,  and  Z  with  their  corresponding  intersection  variables  that 
they  subsume  and  (2)  apply  the  rules  of  d-separation  on  the  abstract  ground  graph  for 
the  common  perspective  of  X,  Y,  and  Z.  Since  abstract  ground  graphs  are  defined  from 
a  specific  perspective,  every  relational  model  produces  a  set  of  abstract  ground  graphs ,  one 
for  each  perspective  in  its  underlying  schema. 


Example  5.1  Figure  5.1  shows  the  abstract  ground  graph  A GG m, Employee  for  the  or¬ 
ganization  example  from  the  Employee  perspective  with  hop  threshold  h  =  60  As  in 
Section  [2j  we  derive  an  appropriate  conditioning  set  Z  in  order  to  d-separate  individual  em¬ 
ployee  competence  (X  =  {[Employee]. Competence})  from  the  revenue  of  the  employee’s 
funding  business  units  (Y  =  {[Employee,  Develops,  Product,  Funds,  Business- 
Unit]  .Revenue}).  Applying  the  rules  of  d-separation  to  the  abstract  ground  graph,  we 
see  that  it  is  necessary  to  condition  on  both  product  success  ([Employee,  Develops, 
Product], Success)  and  the  competence  of  other  employees  developing  the  same  products 
([Employee,  Develops,  Product,  Develops,  Employee], Competence).  For  h  =  6, 
augmenting  X,  Y,  and  Z  with  their  corresponding  intersection  variables  does  not  result 
in  any  changes.  For  h  =  8,  the  abstract  ground  graph  includes  a  node  for  relational  vari¬ 
able  [Employee,  Develops,  Product,  Develops,  Employee,  Develops,  Product, 
Funds,  Business-Unit]  .Revenue  (the  revenue  of  the  business  units  funding  the  other  prod¬ 


ucts  of  collaborating  employees)  which,  by  Lemma  4.1  could  have  a  non-empty  intersection 


with  [Employee,  Develops,  Product,  Funds,  Business-Unit], Revenue.  Therefore,  Y 
would  also  include  the  intersection  with  this  other  relational  variable.  However,  for  this 
query,  the  conditioning  set  Z  for  h  =  6  happens  to  also  d-separate  at  h  =  8  (and  any 
h  <5  N°).  □ 


Using  the  algorithm  devised  by  Geiger  et  al.  (1990),  relational  d-separation  queries  can 


be  answered  in  0(\E\)  time  with  respect  to  the  number  of  edges  in  the  abstract  ground 
graph.  In  practice,  the  size  of  an  abstract  ground  graph  depends  on  the  relational  schema 
and  model  (e.g.,  the  number  of  entity  classes,  the  types  of  cardinalities,  the  number  of 


dependencies — see  the  experiment  in  Section  7.1),  as  well  as  the  hop  threshold  limiting  the 
length  of  relational  paths.  For  the  example  in  Figure  [5T|  the  abstract  ground  graph  has  7 


nodes  and  7  edges  (including  1  intersection  node  with  2  edges);  for  h  =  8,  it  would  have  13 
nodes  and  21  edges  (including  4  intersection  nodes  with  13  edges).  Abstract  ground  graphs 
are  invariant  to  the  size  of  ground  graphs,  even  though  ground  graphs  can  be  arbitrarily 
large — that  is,  relational  databases  have  no  maximum  size. 

Next,  we  formally  define  the  extend  method,  which  is  used  internally  for  the  construction 
of  abstract  ground  graphs.  This  method  translates  dependencies  specified  in  the  model  into 
dependencies  in  the  abstract  ground  graph. 


6.  In  theory,  abstract  ground  graphs  can  have  an  infinite  number  of  nodes  as  relational  paths  may  have 
no  bound.  In  practice,  a  hop  threshold  h  £  N°  is  enforced  to  limit  the  length  of  these  paths.  Hops  are 
defined  as  the  number  of  times  the  path  “hops”  between  item  classes  in  the  schema,  or  one  less  than  the 
length  of  the  path. 

7.  The  variables  Salary  and  Budget  are  removed  for  simplicity.  They  are  irrelevant  for  this  d-separation 
example  as  they  are  solely  effects  of  other  variables. 
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([EMPLOYEE,  DEVELOPS,  PRODUCT,  FUNDS,  BUSINESS-UNIT,  FUNDS,  PRODUCT].Sl/ccess 


([EMPLOYEE].  Competence')  »  ([EMPLOYEE,  DEVELOPS,  PRODUCT].Success)  »  ([EMPLOYEE,  DEVELOPS,  PRODUCT,  FUNDS,  BUSINESS-UNIT].Reyemie  ) 


([EMPLOYEE,  DEVELOPS,  PRODUCT,  DEVELOPS,  EMPLOYEE], Competence 


[EMPLOYEE,  DEVELOPS,  PRODUCT,  DEVELOPS,  EMPLOYEE,  DEVELOPS,  PRODUCT].Success 

n 

[EMPLOYEE,  DEVELOPS,  PRODUCT,  FUNDS,  BUSINESS-UNIT,  FUNDS,  PRODUCT].Success 


( [EMPLOYEE,  DEVELOPS,  PRODUCT,  DEVELOPS,  EMPLOYEE,  DEVELOPS,  PRODUCfl.Success  ) 


2.: 


Figure  5.1:  The  abstract  ground  graph  for  the  organization  domain  model  in  Figure 
from  the  Employee  perspective  with  hop  threshold  h  =  6  (with  the  variables  for  Salary 
and  Budget  omitted  for  simplicity).  This  abstract  ground  graph  includes  one  intersection 
node. 


Definition  5.3  (Extending  relational  paths)  Let  P0rig  and  Pext  be  two  relational  paths 
for  schema  S.  The  following  three  functions  extend  Porig  with  Pext- 


extend  (Porig,  Pext )  =  {p  =  p]rig  1+1 +Plext1,ne  \  i  £  pivots  (reverse  (Porig) ,  Pext )  A  isValid(P)} 
pivots  (Pi,  P2)  =  {i  |  Pi'1  =  P2’1} 


isValid(P) 


True  if  P  does  not  violate  Definition  14.31 
False  otherwise 


where  n0  is  the  length  of  Porig,  ne  is  the  length  of  Pext,  P corresponds  to  1-based  i- 
inclusive,  j-inclusive  subpath  indexing,  +  is  concatenation  of  paths,  and  reverse  is  a  method 
that  reverses  the  order  of  the  path. 


The  extend  method  constructs  a  set  of  valid  relational  paths  from  two  input  relational 
paths.  It  first  finds  the  indices  (called  pivots)  of  the  item  classes  for  which  the  input  paths 
(reverse(Porig)  and  Pext )  have  a  common  starting  subpath.  Then,  it  concatenates  the  two 
input  paths  at  each  pivot,  removing  one  of  the  duplicated  subpaths  (see  Example |5.2[).  Since 
d-separation  requires  blocking  all  paths  of  dependence  between  two  sets  of  variables,  the 
extend  method  is  critical  to  ensure  the  soundness  and  completeness  of  our  approach.  The 
abstract  ground  graph  must  capture  all  paths  of  dependence  among  the  random  variables 
in  the  relational  variable  instances  for  all  represented  ground  graphs.  However,  relational 
model  structures  are  specified  by  relational  dependencies,  each  from  a  given  perspective  and 
with  outcomes  that  have  singleton  relational  paths.  The  extend  method  is  called  repeatedly 
during  the  creation  of  an  abstract  ground  graph,  with  Porig  set  to  some  relational  path  and 
Pext  drawn  from  the  relational  path  of  the  treatment  in  some  relational  dependency. 

Example  5.2  During  the  construction  of  the  abstract  ground  graph  A GG_m,  Employee  de¬ 
picted  in  Figure [5~T|  the  extend  method  is  called  several  times.  First,  all  relational  variables 
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from  the  Employee  perspective  are  added  as  nodes  in  the  graph.  Next,  extend  is  used  to 
insert  edges  corresponding  to  direct  causes.  Consider  the  node  for  [Employee,  Devel¬ 
ops,  Product], Success.  The  construction  of  AGGm,  Employee  calls  extend(Porig,  Pext) 
with  Pong  =  [Employee,  Develops,  Product]  and  Pext  =  [Product,  Develops,  Em¬ 
ployee]  because  [Product,  Develops,  Employee], Competences  [Product]. Success^ 
V.  Here,  extend (Porig,  Pext)  =  {[Employee],  [Employee,  Develops,  Product,  Devel¬ 
ops,  Employee]},  which  leads  to  the  insertion  of  two  edges  in  the  abstract  ground  graph. 
Note  that  pivots  (reverse(Porig),  Pext)  =  {1,2,3},  and  the  pivot  at  i  =  2  yields  the  invalid 
relational  path  [Employee,  Develops,  Employee],  □ 

We  also  describe  two  important  properties  of  the  extend  method  with  the  following  two 
lemmas.  The  first  lemma  states  that  every  relational  path  produced  by  extend  yields  a 
terminal  set  for  some  skeleton  such  that  there  is  an  item  instance  also  reachable  by  the  two 
original  paths.  This  lemma  is  useful  for  proving  the  soundness  of  our  abstraction:  All  edges 
inserted  in  an  abstract  ground  graph  correspond  to  edges  in  some  ground  graph. 


Lemma  5.1  Let  Porig  =  [h  and  Pext  =  [Ij, ,  A]  be  two  relational  paths  with 

P  =  extend(Por~ig,  Pext) ■  Then,  VP  G  P  there  exists  a  relational  skeleton  a  G  such  that 
3i\  G  cr(/i)  such  that  3ik  G  P\it  and  3ij  G  Porigln  such  that  i &  G  Pext[iy 


Proof.  See  Appendix  [A} 


Example  5.3  Let  a  be  the  skeleton  shown  in  Figure  2.1}b)  let  Porig  =  [Employee, 
Develops,  Product],  let  Pext  =  [Product,  Develops,  Employee],  and  let  i\  = 
Sally  G  ct(Employee).  From  Example  5.2,  we  know  that  P  =  extend (Porig,  Pext)  = 
{[Employee],  [Employee,  Develops,  Product,  Develops,  Employee]}.  We  also 
have  [Employee]  | sany  =  {Sally}  and  [Employee,  Develops,  Product,  Develops,  Em 
ployee] |saiiy  =  {Quinn,  Roger,  Thomas}.  By  Lemma 


5.1 


there  should  exist  an  ij  G  Porig  \  q 
such  that  Sally  and  at  least  one  of  Quinn,  Roger,  and  Thomas  are  in  the  terminal  set 
Pext\iy  We  have  Porig Isaiiy  =  {Laptop,  Tablet},  and  -Pe:rt| Laptop  =  {Quinn,  Roger,  Sally} 
and  P ext  | Tablet  =  {Sally,  Thomas}.  So,  the  lemma  clearly  holds  for  this  example.  □ 


Lemma  5.1  guarantees  that,  for  some  relational  skeleton,  there  exists  an  item  instance 
in  the  terminal  sets  produced  by  extend  that  also  appears  in  the  terminal  set  of  Pext  via 
some  instance  in  the  terminal  set  of  Porig-  It  is  also  possible  (although  infrequent)  that 
there  exist  items  reachable  by  Porig  and  Pext  that  are  not  in  the  terminal  set  of  any  path 
produced  with  extend  (Porig,  Pext)-  The  following  lemma  describes  this  unreachable  set  of 
items,  stating  that  there  must  exist  an  alternative  relational  path  P'orig  that  intersects  with 
Porig  and,  when  using  extend,  catches  those  remaining  items.  This  lemma  is  important  for 
proving  the  completeness  of  our  abstraction:  All  edges  in  all  ground  graphs  are  represented 
in  the  abstract  ground  graph. 


Lemma  5.2  Let  a  G  T>s  be  a  relational  skeleton,  and  let  Porig  =  [I\,...,Ij]  and  Pext  = 
[Ij , . . . ,  /fc]  be  two  relational  paths  with  P  =  extend(Porig,  Pext)-  Then,  Vi±  G  Vij  G 
Porig\ii  Vi^  G  P ext\ ij  if  VP  G  P  4  ^  -P|ii>  then  3Porig  such  that  Porig\i\  n  Porig\ii  7^  0  and 
ik  e  -P'lii  for  some  P'  G  extend (P'orig,  Pext)- 
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Proof.  See  Appendix  [Aj 


Example  5.4  Although  Lemma  5.2  does  not  apply  to  the  organization  domain  as  cur¬ 


rently  represented,  it  could  apply  if  either  (1)  there  were  cycles  in  the  relational  schema 
or  (2)  the  path  specifications  on  the  relational  dependencies  included  a  cycle.  Consider 
additional  relationships  between  employees  and  products.  If  employees  could  be  involved 
with  products  at  various  stages  (e.g.,  research,  development,  testing,  marketing),  then  there 
would  be  alternative  relational  paths  for  which  the  lemma  might  apply.  The  proof  of  the 
lemma  in  Appendix  [A]  provides  abstract  conditions  describing  when  the  lemma  applies.  □ 


5.2  Proof  of  Correctness 

The  correctness  of  our  approach  to  relational  d-separation  relies  on  several  facts:  (1)  d- 
separation  is  valid  for  directed  acyclic  graphs;  (2)  ground  graphs  are  directed  acyclic  graphs; 
and  (3)  abstract  ground  graphs  are  directed  acyclic  graphs  that  represent  exactly  the  edges 
that  could  appear  in  all  possible  ground  graphs.  It  follows  that  d-separation  on  abstract 
ground  graphs,  augmented  by  intersection  variables,  is  sound  and  complete  for  all  ground 
graphsj^]  Additionally,  we  show  that  since  relational  d-separation  is  sound  and  complete, 
it  is  also  equivalent  to  the  Markov  condition  for  relational  models.  Using  the  previous 
definitions  and  lemmas,  the  following  sequence  of  results  proves  the  correctness  of  our 
approach  to  identifying  independence  in  relational  models. 


Theorem  5.1  The  rules  of  d-separation  are  sound  and  complete  for  directed  acyclic  graphs. 


Proof.  Due  to  Yerma  and  Pearl  (1988)  for  soundness  and  Geiger  and  Pearl  (1988)  for 
completeness.  ■ 


Theorem |5.1 1  implies  that  (1)  all  conditional  independence  facts  derived  by  d-separation 
on  a  Bayesian  network  structure  hold  in  any  distribution  represented  by  that  model  (sound¬ 
ness)  and  (2)  all  conditional  independence  facts  that  hold  in  a  faithful  distribution  can  be 
inferred  from  d-separation  applied  to  the  Bayesian  network  that  encodes  the  distribution 
(completeness). 


Lemma  5.3  For  every  acyclic  relational  model  structure  M  and  skeleton  a  E  £5,  the 
ground  graph  GGj^ia  is  a  directed  acyclic  graph. 


Proof.  Due  to  both  Heckerman  et  al.  (2007)  for  DAPER  models  and  Getoor  (2001)  for 
PRMs.  ■ 


By  Theorem  5.1  and  Lemma  5.3  d-separation  is  sound  and  complete  when  applied  to  a 


ground  graph.  However,  Definition  5.1  states  that  relational  d-separation  must  hold  across 
all  possible  ground  graphs,  which  is  the  reason  for  constructing  the  abstract  ground  graph 
representation. 


8.  In  Appendix  |Ej  we  provide  proofs  of  soundness  and  completeness  for  abstract  ground  graphs  and  rela¬ 
tional  d-separation  that  are  limited  by  practical  hop  threshold  bounds. 
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Theorem  5.2  For  every  acyclic  relational  model  structure  A4  and  perspective  B  &  £  UTZ, 
the  abstract  ground  graph  AGGmb  is  sound  and  complete  for  all  ground  graphs  GGmct  with 
skeleton  a  E  S5. 

Proof.  See  Appendix  [Aj 

Theorem |5.2| guarantees  that,  for  a  given  perspective,  an  abstract  ground  graph  captures 
all  possible  paths  of  dependence  between  any  pair  of  variables  in  any  ground  graph.  The 
details  of  the  proof  provide  the  reasons  why  explicitly  representing  intersection  variables  is 
necessary  for  ensuring  a  sound  and  complete  abstraction. 

Theorem  5.3  For  every  acyclic  relational  model  structure  Ad  and  perspective  B  E  £  UlZ, 
the  abstract  ground  graph  AGGmb  is  directed  and  acyclic. 

Proof.  See  Appendix  [Aj 

Theorem  |5.3|  ensures  that  the  standard  rules  of  d-separation  can  apply  directly  to  ab¬ 
stract  ground  graphs  because  they  are  acyclic  given  an  acyclic  model.  We  now  have  suf¬ 
ficient  supporting  theory  to  prove  that  d-separation  on  abstract  ground  graphs  is  sound 
and  complete.  In  the  following  theorem,  we  define  W  as  the  set  of  nodes  augmented 
with  their  corresponding  intersection  nodes  for  the  set  of  relational  variables  W:  W  = 
W  U  Uivewi^  H  W  |  W  n  W'  is  an  intersection  node  in  AGGmb}- 

Theorem  5.4  Relational  d-separation  is  sound  and  complete  for  abstract  ground  graphs. 
Let  A4  be  an  acyclic  relational  model  structure,  and  let  X,  Y,  and  Z  be  three  distinct  sets 
of  relational  variables  for  perspective  B  E  £  UlZ  defined  over  relational  schema  S .  Then, 
X  and  Y  are  d-separated  by  Z  on  the  abstract  ground  graph  AGGmb  if  and  only  if  for  all 
skeletons  ff  6  S5  and  for  all  b  E  cr (B),  X|&  and  Y|&  are  d-separated  by  Z\b  in  ground  graph 
GGmct- 


Proof.  We  must  show  that  d-separation  on  an  abstract  ground  graph  implies  d-separation 
on  all  ground  graphs  it  represents  (soundness)  and  that  d-separation  facts  that  hold  across 
all  ground  graphs  are  also  entailed  by  d-separation  on  the  abstract  ground  graph  (com¬ 
pleteness)  . 

Soundness:  Assume  that  X  and  Y  are  d-separated  by  Z  on  AGGmb-  Assume  for 
contradiction  that  there  exists  an  item  instance  b  E  cr(B)  such  that  X| &  and  Y|j,  are  not 
d-separated  by  Z|&  in  the  ground  graph  GGm<t  for  some  arbitrary  skeleton  a.  Then,  there 
must  exist  a  d-connecting  path  p  from  some  x  E  X|&  to  some  y  E  Y|j,  given  all  z  E  Z 1 5. 
By  Theorem  ZL2  AGGmb  is  complete,  so  all  edges  in  GGmcj  are  captured  by  edges  in 
AGGmb-  So,  path  p  must  be  represented  from  some  node  in  {Nx  \  x  E  Nx\b}  to  some  node 
in  {Ny  |  y  E  Xy|fe},  where  Nx ,  Ny  are  nodes  in  AGGmb-  If  P  is  d-connecting  in  GGmct, 
then  it  is  d-connecting  in  AGGmb,  implying  that  X  and  Y  are  not  d-separated  by  Z.  So, 
X|b  and  Y\b  must  be  d-separated  by  Z^. 

Completeness:  Assume  that  X|j  and  Y|{,  are  d-separated  by  Z|j,  in  the  ground  graph 
GGmct  for  all  skeletons  a  for  all  b  E  cr{B).  Assume  for  contradiction  that  X  and  Y  are 
not  d-separated  by  Z  on  AGGmb-  Then,  there  must  exist  a  d-connecting  path  p  for  some 
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relational  variable  X  G  X  to  some  Y  G  Y  given  all  Z  G  Z.  By  Theorem  5.2,  AGGmb  is 
sound,  so  every  edge  in  AGGmb  must  correspond  to  some  pair  of  variables  in  some  ground 
graph.  So,  if  p  is  (/-connecting  in  AGGmb ,  then  there  must  exist  some  skeleton  a  such  that 
p  is  (/-connecting  in  GGmo  for  some  b  G  cr(B),  implying  that  (/-separation  does  not  hold 
for  that  ground  graph.  So,  X  and  Y  must  be  (/-separated  by  Z  on  AGGmb •  M 


Theorem  5.4  proves  that  (/-separation  on  abstract  ground  graphs  is  a  sound  and  complete 

Theorem  |5.1|  also  implies  that 


solution  to  identifying  independence  in  relational  models, 
the  set  of  conditional  independence  facts  derived  from  abstract  ground  graphs  is  exactly 
the  same  as  the  set  of  conditional  independencies  that  all  distributions  represented  by  all 
possible  ground  graphs  have  in  common. 


Corollary  5.1  X  and  Y  are  d-connected  given  Z  on  the  abstract  ground  graph  AGGmb 
if  and  only  if  there  exists  a  skeleton  a  G  £5  and  an  item  instance  b  G  cr(B)  such  that  X|& 
and  Y|f,  are  d-connected  given  Z| &  in  ground  graph  GGmo- 


Corollary  5.1  is  logically  equivalent  to  Theorem  |5.4|  While  a  simple  restatement  of 
the  previous  theorem,  it  is  important  to  emphasize  that  relational  (/-separation  claims  d- 
connection  if  and  only  if  there  exists  a  ground  graph  for  which  Xf,  and  Y |&  are  (/-connected 
given  Z|&.  This  implies  that  there  may  be  some  ground  graphs  for  which  X\ &  and  Y| b  are 
(/-separated  by  Z| &,  but  the  abstract  ground  graph  still  claims  (/-connection.  This  may 
happen  if  the  relational  skeleton  does  not  enable  certain  underlying  relational  connections. 


For  example,  if  the  relational  skeleton  in  Figure  2.1  b)  included  only  products  that  were 
developed  by  a  single  employee,  then  there  would  be  no  relationally  (/-connecting  path  in 
the  example  in  Section  [2j  If  this  is  a  fundamental  property  of  the  domain  (e.g.,  there  are 
products  developed  by  a  single  employee  and  products  developed  by  multiple  employees), 
then  revising  the  underlying  schema  to  include  two  different  classes  of  products  would  yield 
a  more  accurate  model  implying  a  larger  set  of  conditional  independencies. 

Additionally,  we  can  show  that  relational  (/-separation  is  equivalent  to  the  Markov 
condition  on  relational  models. 


Definition  5.4  (Relational  Markov  condition)  Let  X  be  a  relational  variable  for  per¬ 
spective  B  G  £  U  1Z  defined  over  relational  schema  S.  Let  nd(X)  be  the  non-descendant 
variables  of  X,  and  let  pa(X)  be  the  set  of  parent  variables  of  X.  Then,  for  relational  model 
Me,  P{X  |  nd(X),pa{X ))  =  P[X  \  pa(X ))  if  and  only  if  VxGX|{,  P{x  \  nd(x),pa(x))  = 
P[x  |  pa(x))  in  parameterized  ground  graph  GGMea  for  all  skeletons  a  G  S5  and  for  all 
b  G  cr(L>). 


In  other  words,  a  relational  variable  X  is  independent  of  its  non-descendants  given  its 
parents  if  and  only  if,  for  all  possible  parameterized  ground  graphs,  the  Markov  condition 
holds  for  all  instances  of  X.  For  Bayesian  networks,  the  Markov  condition  is  equivalent 


equivalent  to  the  relational  Markov  condition. 
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6.  Naive  Relational  d-Separation  Is  Frequently  Incorrect 


If  the  rules  of  d-separation  for  Bayesian  networks  were  applied  directly  to  the  structure 
of  relational  models,  how  frequently  would  the  conditional  independence  conclusions  be 
correct?  In  this  section,  we  evaluate  the  necessity  of  our  approach — relational  d-separation 
executed  on  abstract  ground  graphs.  We  empirically  compare  the  consistency  of  a  naive 
approach  against  our  sound  and  complete  solution  over  a  large  space  of  synthetic  causal 
models.  To  promote  a  fair  comparison,  we  restrict  the  space  of  relational  models  to  those 
with  underlying  dependencies  that  could  feasibly  be  represented  and  recovered  by  a  naive 
approach.  We  describe  this  space  of  models,  present  a  reasonable  approach  for  applying 
traditional  d-separation  to  the  structure  of  relational  models,  and  quantify  its  decrease  in 
expressive  power  and  accuracy. 

Consider  the  following  limited  definition  of  relational  paths,  which  itself  limits  the  space 
of  models  and  conditional  independence  queries.  A  simple  relational  path  P  =  [Ij , . . . ,  If.] 
for  relational  schema  S  is  a  relational  path  such  that  Ij  ^  ^  I The  sole  difference 

between  relational  paths  (Definition  |4.3[)  and  simple  relational  paths  is  that  no  item  class 
may  appear  more  than  once  along  the  latter.  This  yields  paths  drawn  directly  from  a 
schema  diagram.  For  the  example  in  Figure  2.l|a) ,  [Employee,  Develops,  Product]  is 
simple  whereas  [Employee,  Develops,  Product,  Develops,  Employee]  is  not. 

Additionally,  we  define  simple  relational  schemas  such  that,  for  any  two  item  classes 
Ij,Ik  £  £  U  IZ,  there  exists  at  most  one  simple  relational  path  between  them  (i.e.,  no 
cycles  occur  in  the  schema  diagram).  The  example  in  Figure  2.l|a)  is  a  simple  relational 
schema.  The  restriction  to  simple  relational  paths  and  schemas  yields  similar  definitions 
for  simple  relational  variables ,  simple  relational  dependencies ,  and  simple  relational  models. 
The  relational  model  in  Figure  2.^[a)  is  simple  because  it  includes  only  simple  relational 


dependencies. 

A  first  approximation  to  relational  d-separation  would  be  to  apply  the  rules  of  tra¬ 
ditional  d-separation  directly  to  the  graphical  representation  of  relational  models.  This  is 
equivalent  to  applying  d-separation  to  the  class  dependency  graph  Gm  (see  Definition  4.10) 
of  relational  model  Ad.  The  class  dependency  graph  for  the  model  in  Figure  2.2  a)  is  shown 
in  Figure  6.1  [a)  Note  that  the  class  dependency  graph  ignores  path  designators  on  de¬ 
pendencies,  does  not  include  all  implications  of  dependencies  among  arbitrary  relational 
variables,  and  does  not  represent  intersection  variables. 

Although  the  class  dependency  graph  is  independent  of  perspectives,  testing  any  condi¬ 
tional  independence  fact  requires  choosing  a  perspective.  All  relational  variables  must  have 
a  common  base  item  class;  otherwise,  no  method  can  produce  a  single  consistent,  proposi¬ 
tional  table  from  a  relational  database.  For  example,  consider  the  construction  of  a  table 
describing  employees  with  columns  for  their  salary,  the  success  of  products  they  develop,  and 
the  revenue  of  the  business  units  they  operate  under.  This  procedure  requires  joining  the 
instances  of  three  relational  variables  ( [Employee]. Salary,  [Employee,  Develops,  Prod¬ 
uct].  Success,  and  [Employee,  Develops,  Product,  Funds,  Business-Unit], Revenue) 
for  every  common  base  item  instance,  from  Paul  to  Thomas.  See,  for  example,  the  resulting 
propositional  table  for  these  relational  variables  and  an  example  query  in  Table  D.l|  and 
Figure  [T)~2  respectively.  An  individual  relational  variable  requires  joining  the  item  classes 
within  its  relational  path,  but  combining  a  collection  of  relational  variables  requires  joining 
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(employee  .Competence) - ►(  PRODUCT.Success  ) - ►(BUSINESS-l 


c 


3 


.  Salary 


•(BUSINESS-UNIT.ReYem/e) 

♦ 

■{  BUSINESS 

-UN IT. Budget  } 

(a) 


([EMPLOYEE]Xompefence)— ►([EMPLOYEE,  DEVELOPS,  PRODUCT]. Success) - ►([EMPLOYEE,  DEVELOPS,  PRODUCT^FUNDS,  BUSINESS-UNIT].  Revenue  ) 

(  [EMPLOYEE].Sa/ary  ')  M  (  [EMPLOYEE,  DEVELOPS,  PRODUCT,  FUNDS,  BUSINESS-UNlT].Burfget  ) 


([PRODUCT,  DEVELOPS, ^MPLOYEE].Competence)  ►  ( [PRODUCT] -Success)  ►  ([PRODUCT,  FUNDS,  B^SINESS-UNlTl.Repenue) 
(  [PRODUCT,  DEVELOPS,  EMPLOYEE].Sa/ary  )  4  C [PRODUCT,  FUNDS,  BUSINESS-UNIT], Budget  ) 


([BUSINESS-UNIT,  FUNDS,  PRODUCT^DEVELOPS,  EMPLOYEE]. Competence') - ►([BUSINESS-UNIT,  FUNDS,  PRODUCT].Success)— ►(  [BUSINESS-UNIT]. Revenue  ) 

(  [BUSINESS-UNIT,  FUNDS,  PRODUCT,  DEVELOPS,  EMPLOYEE] .Safary  (  [BUSINESS-UNIT]. Budget  ) 

(b) 

Figure  6.1:  For  the  model  in  Figure  2A  a),  (a)  the  class  dependency  graph  and  (b)  three 
simple  abstract  ground  graphs  for  the  Employee,  Product,  and  Business-Unit  perspec¬ 
tives. 


on  their  common  base  item  class.  Fortunately,  given  a  perspective  and  the  space  of  simple 
relational  schemas  and  models,  a  class  dependency  graph  is  equivalent  to  a  simple  abstract 
ground  graph. 


Definition  6.1  (Simple  abstract  ground  graph)  For  simple  relational  model  A4  = 
{S,V)  and  perspective  B  e  SUTl,  the  simple  abstract  ground  graph  AGGSMB  is  the  directed 
acyclic  graph  (V,  E )  that  abstracts  the  dependencies  V  among  simple  relational  variables. 
The  nodes  consist  of  simple  relational  variables  {[ B , ...  ,Ij\.X  \  B  ^  ^  ij},  and  the 

edges  connect  those  nodes  { [B, . . . ,  Ik\.Y  — >  [ B , . . . ,  Ij].X  \  [Ij, . . . ,  — >  [Ij].X  G  V  A 

[B,...,Ik]  G  extend([B, . . .  ,Ij\,[Ij, . . .  ,Ik])  A  [B, . . . ,  Ik].Y,  [B, . . . ,  Ij\.X  eV}. 


Simple  abstract  ground  graphs  only  include  nodes  for  simple  relational  variables  and 
necessarily  exclude  intersection  variables.  Lemma  4.1 — which  characterizes  the  intersection 
between  a  pair  of  relational  paths — only  applies  to  pairs  of  simple  relational  paths  if  the 
schema  contains  cycles,  which  is  not  the  case  for  simple  relational  schemas  by  definition. 
As  a  result,  the  simple  abstract  ground  graph  for  a  given  schema  and  model  contains  the 
same  number  of  nodes  and  edges,  regardless  of  perspective;  the  nodes  simply  have  path 
designators  redefined  from  the  given  perspective.  Figure  6.1  b)  shows  three  simple  abstract 
ground  graphs  from  distinct  perspectives  for  the  model  in  Figure  2.^(a)  As  noted  above, 


simple  abstract  ground  graphs  are  qualitatively  the  same  as  the  class  dependency  graph,  but 
they  enable  answering  relational  d-separation  queries,  which  requires  a  common  perspective 
in  order  to  be  semantically  meaningful. 

The  naive  approach  to  relational  d-separation  derives  conditional  independence  facts 
from  simple  abstract  ground  graphs  (Definition  6.1).  The  sound  and  complete  approach 
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described  in  this  paper  applies  d-separation — with  input  variable  sets  augmented  by  their 
intersection  variables — to  “regular”  abstract  ground  graphs,  as  described  by  Definition |5.2| 
Clearly,  if  d-separation  on  a  simple  abstract  ground  graph  claims  that  X  is  d-separated 
from  Y  given  Z,  then  d-separation  on  the  regular  abstract  ground  graph  yields  the  same 
conclusion  if  and  only  if  there  are  no  d-connecting  paths  in  the  regular  abstract  ground 
graph.  Either  X  and  Y  can  be  d-separated  by  a  set  of  simple  relational  variables  Z,  or 
they  require  non-simple  relational  variables — those  involving  relational  paths  with  repeated 
item  classes  0 

To  evaluate  the  necessity  of  regular  abstract  ground  graphs  (i.e. ,  the  additional  paths 
involving  non-simple  relational  variables  and  intersection  variables),  we  compared  the  fre¬ 
quency  of  equivalence  between  the  conclusions  of  d-separation  on  simple  and  regular  ab¬ 
stract  ground  graphs.  The  two  approaches  are  only  equivalent  if  a  minimal  separating  set 
consists  entirely  of  simple  relational  variables 


10 


Thus,  for  an  arbitrary  pair  of  relational  variables  X  and  Y  with  a  common  perspective, 
we  test  the  following  on  regular  abstract  ground  graphs: 


1.  Is  either  X  or  Y  a  non-simple  relational  variable? 

2.  Are  X  and  Y  marginally  independent? 

3.  Does  a  minimal  separating  set  Z  d-separate  X  and  Y,  where  Z  consists  solely  of  simple 
relational  variables? 

4.  Is  there  any  separating  set  Z  that  d-separates  X  and  Y? 

If  the  answer  to  (1)  is  yes,  then  the  naive  approach  cannot  apply  since  either  X  or  Y 
is  undefined  for  the  simple  abstract  ground  graph.  If  the  answer  to  (2)  is  yes,  then  there 
is  equivalence;  this  is  a  trivial  case  because  there  are  no  d-connecting  paths  for  Z  =  0.  If 
the  answer  to  (3)  is  yes,  then  there  is  a  minimal  separating  set  Z  consisting  of  only  simple 
relational  variables.  In  this  case,  the  simple  abstract  ground  graph  is  sufficient,  and  we  also 
have  equivalence.  If  the  answer  to  (4)  is  no,  then  no  separating  set  Z,  simple  or  otherwise, 
renders  X  and  Y  conditionally  independent. 

We  randomly  generated  simple  relational  schemas  and  models  for  100  trials  for  each 
setting  using  the  following  parameters: 


•  Number  of  entity  classes,  ranging  from  1  to  4. 

•  Number  of  relationship  classes,  fixed  at  one  less  than  the  number  of  entities,  ensuring 
simple,  connected  relational  schemas.  Relationship  cardinalities  are  chosen  uniformly 
at  random. 

•  Number  of  attributes  for  each  entity  and  relationship  class,  randomly  drawn  from  a 
shifted  Poisson  distribution  with  A  =  1.0  (~  Pois(  1.0)  +  1). 

•  Number  of  dependencies  in  the  model,  ranging  from  1  to  10. 


9.  The  theoretical  conditions  under  which  equivalence  occurs  are  sufficiently  complex  that  they  provide 
little  utility  as  they  essentially  require  reconstructing  the  regular  abstract  ground  graph  and  checking  a 
potentially  exponential  number  of  dependency  paths. 

10.  If  X  and  Y  are  d-separated  given  Z,  then  Z  is  a  separating  set  for  X  and  Y.  A  separating  set  Z  is 
minimal  if  there  is  no  proper  subset  of  Z  that  is  also  a  separating  set. 
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Figure  6.2:  The  majority  (56%)  of  generated  relational  d-separation  queries  are  not  repre¬ 
sentable  with  the  naive  approach.  Of  the  44%  that  are  representable  (involving  only  simple 
relational  variables),  82%  are  marginally  independent  and  9%  are  dependent.  Pairs  of  re¬ 
lational  variables  in  the  remaining  9%  are  conditionally  independent  given  a  non-empty 
separating  set  ( X  J_L  Y  |  Z,  where  Z  /  0).  We  test  whether  the  conditioning  set  consists 
solely  of  simple  relational  variables.  If  so,  then  the  naive  approach  to  relational  d-separation 
is  equivalent  to  d-separation  on  fully  specified  abstract  ground  graphs.  This  graph  plots  the 
frequency  of  equivalence  across  schemas  with  increasing  numbers  of  entity  classes  (1-4)  for 
varying  numbers  of  dependencies  (1-10).  For  schemas  with  more  than  one  entity  class,  the 
frequency  of  equivalence  decreases  as  the  number  of  dependencies  increases.  Shown  with 
95%  confidence  intervals. 


Then,  for  all  pairs  of  relational  variables  with  a  common  perspective  limited  by  a  hop 
threshold  of  h  =  4,  we  ran  the  aforementioned  tests  against  the  regular  abstract  ground 
graph,  limiting  its  relational  variables  by  a  hop  threshold  of  h  =  8  (a  sufficient  hop  threshold 
for  soundness  and  completeness — see  Appendix  |E|) . 

This  procedure  generated  a  total  of  almost  3.6  million  pairs  of  relational  variables  to  test. 
Approximately  56%  included  a  non-simple  relational  variable;  the  naive  approach  cannot 
be  used  to  derive  a  conditional  independence  statement  in  these  cases,  requiring  the  full 
abstract  ground  graph  in  order  to  represent  these  variables.  Of  the  remaining  44%  (roughly 
1.6  million),  82%  were  marginally  independent,  and  9%  were  not  conditionally  independent 
given  any  conditioning  set  Z.  Then,  of  the  remaining  9%  (roughly  145,000),  we  can  test 
the  frequency  of  equivalence  for  conditional  independence  facts  with  non-empty  separating 
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sets — the  proportion  of  cases  for  which  only  simple  relational  variables  are  required  in  a 
minimal  separating  set  Z. 

Figure  6.2  shows  this  frequency  for  schemas  of  increasing  numbers  of  entity  classes 
(1-4)  for  varying  numbers  of  dependencies  in  the  causal  model  (1-10).  Since  relational 
schemas  with  a  single  entity  class  and  no  relationships  describe  propositional  data,  the 
simple  abstract  ground  graph  is  equivalent  to  the  full  abstract  ground  graph,  which  is  also 
equivalent  to  the  model  itself.  In  this  case,  the  naive  approach  is  always  equivalent  because  it 
is  exactly  d-separation  on  Bayesian  networks.  For  truly  relational  schemas  (with  more  than 
one  entity  class  and  at  least  one  relationship  class),  the  frequency  of  equivalence  decreases 
as  the  number  of  dependencies  in  the  model  increases.  Additionally,  the  frequency  of 
equivalence  decreases  more  as  the  number  of  entities  in  the  schema  increases.  For  example, 
the  frequency  of  equivalence  for  nine  dependencies  is  60.3%  for  two  entities,  51.2%  for  three 
entities,  and  43.2%  for  four  entities. 

We  also  learned  statistical  models  that  predict  the  number  of  equivalent  and  non¬ 
equivalent  statements  in  order  to  identify  key  factors  that  affect  the  frequency  of  equivalence. 
We  found  that  the  number  of  dependencies  and  size  of  the  relational  model  (regulated  by 
the  number  of  entities  and  MANY  cardinalities)  dictate  the  equivalence.  As  a  relational 
model  deviates  from  a  Bayesian  network,  we  should  expect  more  d-connecting  paths  in  the 
regular  but  not  simple  abstract  ground  graph.  This  property  also  depends  on  the  specific 
combination  of  dependencies  in  the  model.  Appendix  [F]  presents  details  of  this  analysis. 

This  experiment  suggests  that  applying  traditional  d-separation  directly  to  a  relational 
model  structure  will  frequently  derive  incorrect  conditional  independence  facts.  Addition¬ 
ally,  there  is  a  large  class  of  conditional  independence  queries  involving  non-simple  variables 
for  which  such  an  approach  is  undefined.  These  results  indicate  that  fully  specifying  ab¬ 
stract  ground  graphs  and  applying  d-separation  augmented  with  intersection  variables  (as 
described  in  Section  [5])  is  critical  for  accurately  deriving  most  conditional  independence 
facts  from  relational  models. 


7.  Experiments 


To  complement  the  theoretical  results,  we  present  three  experiments  on  synthetic  data.  The 
primary  goal  of  these  empirical  results  is  to  demonstrate  the  feasibility  of  applying  relational 


d-separation  in  practice.  The  experiment  in  Section  7.1  describes  the  factors  that  influence 
the  size  of  abstract  ground  graphs  and  thus  the  computational  complexity  of  relational 


d-separation.  The  experiment  in  Section  7.2  evaluates  the  growth  rate  of  separating  sets 
produced  by  relational  d-separation  as  abstract  ground  graphs  become  large.  The  results 
indicate  that  minimal  separating  sets  grow  much  more  slowly  than  abstract  ground  graphs. 


The  experiment  in  Section  7.3  tests  how  the  expectations  of  the  relational  d-separation 
theory  match  statistical  conclusions  on  simulated  data.  As  expected  from  the  proofs  of 


correctness  in  Section  5.2,  the  results  indicate  a  close  match,  aside  from  Type  I  errors  and 
certain  biases  of  conventional  statistical  tests  on  relational  data. 


7.1  Abstract  Ground  Graph  Size 

Relational  d-separation  is  executed  on  abstract  ground  graphs.  Consequently,  it  is  impor¬ 
tant  to  quantify  the  size  of  abstract  ground  graphs  and  identify  which  factors  influence  their 
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Figure  7.1:  Variation  of  abstract  ground  graph  size  as  (a)  the  number  of  MANY  cardinalities 
in  the  schema  increases  (dependencies  fixed  at  10)  and  (b)  the  number  of  dependencies 
increases.  Shown  with  95%  confidence  intervals. 


size.  We  randomly  generated  relational  schemas  and  models  for  1,000  trials  of  each  setting 
using  the  following  parameters: 

•  Number  of  entity  classes,  ranging  from  1  to  4. 

•  Number  of  relationship  classes,  ranging  from  0  to  4.  The  schema  is  guaranteed  to  be 
fully  connected  and  includes  at  most  a  single  relationship  between  a  pair  of  entities. 
Relationship  cardinalities  are  selected  uniformly  at  random. 

•  Number  of  attributes  for  each  entity  and  relationship  class,  randomly  drawn  from  a 
shifted  Poisson  distribution  with  A  =  1.0  (~  Pois(l.O)  +  1). 

•  Number  of  dependencies  in  the  model,  ranging  from  1  to  15. 


This  procedure  generated  a  total  of  450,000  abstract  ground  graphs,  which  included 
every  perspective  (all  entity  and  relationship  classes)  for  each  experimental  combination. 
We  measure  size  as  the  number  of  nodes  and  edges  in  a  given  abstract  ground  graph. 
Figure  7.1  [a)  depicts  how  the  size  of  abstract  ground  graphs  varies  with  respect  to  the 


number  of  MANY  cardinalities  in  the  schema  (fixed  for  models  with  10  dependencies),  and 
Figure  7.1  b)  shows  how  it  varies  with  respect  to  the  number  of  dependencies  in  the  model. 
Recall  that  for  a  single  entity,  abstract  ground  graphs  are  equivalent  to  Bayesian  networks. 

To  determine  the  most  influential  factors  of  abstract  ground  graph  size,  we  ran  log-linear 
regression  using  independent  variables  that  describe  only  the  schema  and  model.  Detailed 
results  are  provided  in  AppendixjGj  This  analysis  indicates  that  (1)  as  the  number  of  entities, 
relationships,  attributes,  and  MANY  cardinalities  increases,  the  number  of  nodes  and  edges 
grows  at  an  exponential  rate.  (2)  As  the  number  of  dependencies  in  the  model  increases, 
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Figure  7.2:  Minimal  separating  sets  have  reasonable  sizes,  growing  only  with  the  size  of  the 
schema  and  the  model  density.  In  this  experiment,  99.9%  of  variable  pairs  have  a  minimal 
separating  set  with  five  or  fewer  variables. 


the  number  of  edges  increases  linearly,  but  the  number  of  nodes  remains  invariant.  And 
(3)  abstract  ground  graphs  for  relationship  perspectives  are  larger  than  entity  perspectives 
because  more  relational  variables  can  be  defined. 


7.2  Minimal  Separating  Set  Size 


Because  abstract  ground  graphs  can  become  large,  one  might  expect  that  separating  sets 
could  also  grow  to  impractical  sizes.  Fortunately,  relational  d-separation  produces  minimal 
separating  sets  that  are  empirically  observed  to  be  small.  We  ran  1,000  trials  of  each  setting 
using  the  following  parameters: 

•  Number  of  entity  classes,  ranging  from  1  to  4. 

•  Number  of  relationship  classes,  fixed  at  one  less  than  the  number  of  entities.  Rela¬ 
tionship  cardinalities  are  selected  uniformly  at  random. 

•  Total  number  of  attributes  across  entity  and  relationship  classes,  fixed  at  10. 

•  Number  of  dependencies  in  the  model,  ranging  from  1  to  10. 


For  each  relational  model,  we  identified  a  single  minimal  separating  set  for  up  to  100  ran¬ 
domly  chosen  pairs  of  conditionally  independent  relational  variables.  This  procedure  gen¬ 
erated  almost  2.5  million  pairs  of  variables. 


To  identify  a  minimal  separating  set  between  relational  variables  X  and  Y,  we  modified 
Algorithm  4  devised  by  Tian  et  al.  (1998)  by  starting  with  all  parents  of  X  and  Y.  the  vari¬ 
ables  augmented  with  the  intersection  variables  they  subsume  in  the  abstract  ground  graph. 
While  the  discovered  separating  sets  are  minimal ,  they  are  not  necessarily  of  minimum  size 
because  of  the  greedy  process  for  removing  conditioning  variables  from  the  separating  set. 
Figure  |7.2|  shows  the  frequency  of  separating  set  size  as  both  the  number  of  entities  and 
dependencies  vary.  In  summation,  roughly  83%  of  the  pairs  are  marginally  independent 
(having  empty  separating  sets),  13%  have  separating  sets  of  size  one,  and  less  than  0.1% 
have  separating  sets  with  more  than  five  variables.  The  experimental  results  indicate  that 
separating  set  size  is  strongly  influenced  by  model  density,  primarily  because  the  number 
of  potential  d-connecting  paths  increases  as  the  number  of  dependencies  increases. 
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7.3  Empirical  Validity 

As  a  practical  demonstration,  we  examined  how  the  expectations  of  the  relational  d- 
separation  theory  match  the  results  of  statistical  tests  on  actual  data.  We  use  a  standard 
procedure  for  empirically  measuring  internal  validity  of  algorithms.  In  this  case,  we  (1)  ran¬ 
domly  generate  a  relational  schema,  (2)  randomly  generate  a  relational  model  structure  for 
that  schema,  (3)  parameterize  the  model  structure,  (4)  generate  synthetic  data  according 
to  the  model  structure  and  parameters,  (5)  randomly  choose  relational  d-separation  queries 
according  to  the  known  ground-truth  model,  and  (6)  compare  the  model  theory  (i.e.,  the  d- 
separation  conclusions)  against  corresponding  statistical  tests  of  conditional  independence. 

For  steps  (1)  and  (2),  we  randomly  generated  a  relational  schema  S  and  relational  model 
structure  At  for  S  for  100  trials  using  the  following  settings: 

•  Number  of  entity  classes,  ranging  from  1  to  4. 

•  Number  of  relationship  classes,  fixed  at  one  less  than  the  number  of  entities.  Rela¬ 
tionship  cardinalities  are  selected  uniformly  at  random. 

•  Number  of  attributes  for  each  entity  and  relationship  class,  randomly  drawn  from  a 
shifted  Poisson  distribution  with  A  =  1.0  (~  Pois(l.O)  +  1). 

•  Number  of  dependencies  in  the  model,  fixed  at  10. 

Dependencies  were  selected  greedily,  choosing  each  one  uniformly  at  random,  subject  to  a 
maximum  of  3  parent  relational  variables  for  each  attribute  [Ij]-X  and  enforcing  acyclicity 
of  the  model  structure. 

For  step  (3),  we  parameterized  relational  models  using  simple  additive  linear  equations 
with  independent,  normally  distributed  error  and  the  average  aggregate  for  relational  vari¬ 
able  instances.  For  each  attribute  [Ij\.X,  we  assign  a  conditional  probability  distribution 

^2  (P  ■  avg([Ij,.  ..,Ik\-Y))  +  O.le 

[Ij  ,...,Ik].Y  Eparents([Ij].X) 


if  [Ij\.X  has  parents,  where 


P  = 


0.9 


\  parents  ([I j\.X)\ 


to  provide  equal  contribution  for  each  direct  cause  and  e  ~  1V(0, 1)  (error  drawn  from  a 
standard  normal  distribution).  If  [Ij\.X  has  no  parents,  its  value  is  just  drawn  from  e. 

For  step  (4) ,  we  first  generated  a  relational  skeleton  a  (because  the  current  model  space 
assumes  that  attributes  do  not  cause  entity  or  relationship  existence)  and  then  populated 
each  attribute  value  by  drawing  from  its  corresponding  conditional  distribution.  Each  entity 
class  is  initialized  to  1,000  instances.  Relationship  instances  were  constructed  via  a  latent 


homophily  process,  similar  to  the  method  used  by  Shalizi  and  Thomas  (2011).  Each  entity 


instance  received  a  single  latent  variable,  marginally  independent  from  all  other  variables. 
The  probability  of  any  relationship  instance  was  drawn  from 

„—ad 


1  +  e 


—ad 


the  inverse  logistic  function,  where  d  =  \Le1  —  Le2\,  the  difference  between  the  latent 
variables  on  the  two  entities,  and  a  =  10,  set  as  the  decay  parameter.  We  also  scaled 
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Figure  7.3:  The  proportion  of  significant  trials  for  statistical  tests  of  conditional  indepen¬ 
dence  on  actual  data.  (Left)  Evaluating  queries  that  the  model  claims  to  be  d-separated 
produces  low  rates  of  significant  effects.  (Right)  Queries  that  the  model  claims  are  d- 
connected  produce  high  rates  of  significant  effects.  Note  that  the  generative  process  yields 
denser  models  for  2  entity  classes  since  the  number  of  dependencies  is  fixed  at  10. 


the  probabilities  in  order  to  produce  an  expected  degree  of  five  for  each  entity  instance 
when  the  cardinality  of  the  relationship  is  many.  Since  the  latent  variables  are  marginally 
independent  of  all  others,  they  are  safely  omitted  from  abstract  ground  graphs;  their  sole 
purpose  is  to  generate  relational  skeletons  that  provide  a  greater  probability  of  non-empty 
intersection  variables  as  opposed  to  a  random  underlying  link  structure.  We  generated 
100  independent  relational  skeletons  and  attribute  values  (i.e. ,  100  instantiated  relational 
databases)  for  each  schema  and  model. 

Step  (5)  randomly  chooses  up  to  100  true  and  false  relational  d-separation  queries  for  a 
given  modelj^j]  Since  we  have  the  ground-truth  model,  we  can  evaluate  with  our  approach 
(abstract  ground  graphs  and  relational  d-separation)  whether  these  queries  are  true  (d- 
separated)  or  false  (d-connected).  Each  query  is  of  the  form  X  _LL  Y  |  Z  such  that  X 
and  Y  are  single  relational  variables,  Z  is  a  set  of  relational  variables,  Y  has  a  singleton 
relational  path  (e.g.,  [I*.]. Y),  and  all  variables  are  from  a  common  perspective.  These  queries 
correspond  to  testing  potential  direct  causal  dependencies  in  the  relational  model,  similar 
to  the  tests  used  by  constraint-based  methods  for  learning  relational  models,  such  as  RPC 


(Maier  et  ah,  2010)  and  RCD  (Maier  et  ah,  2013) 


Finally,  step  (6)  tests  for  conditional  independence  for  all  such  (X,  Y,  Z)  d-separation 
queries  using  linear  regression  (because  the  models  were  parameterized  linearly)  for  each 
of  the  100  data  instantiations.  Specifically,  we  tested  the  t-statistic  for  the  coefficient  of 
avg(X)  in  the  equation  Y  =  (3q  +  j3\  ■  avg(X)  +  Ez-gz/%  ■  avg(Zi).  For  each  query,  we 
recorded  two  measurements: 

•  The  average  strength  of  effect,  measured  as  squared  partial  correlation — the  propor¬ 
tion  of  remaining  variance  of  Y  explained  by  X  after  conditioning  on  Z 


11.  Depending  on  the  properties  of  the  schema  and  model,  it  may  not  always  be  feasible  to  identify  100  true 
or  false  d-separation  statements. 
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Figure  7.4:  The  average  strength  of  effect  of  each  query  (measured  as  squared  partial 
correlation)  on  actual  data.  (Left)  Evaluating  queries  that  the  model  claims  to  be  d- 
separated  or  conditionally  independent  produces  low  average  effect  sizes.  (Right)  Queries 
that  the  model  claims  are  d-connected  or  dependent  produce  high  average  effect  sizes. 


The  proportion  of  trials  for  which  each  query  was  deemed  significant  at  a  =  0.01 
adjusted  using  Bonferroni  correction  with  the  number  of  queries  per  trial 


Figure  7.3  shows  the  distribution  of  the  proportion  of  significant  trials  for  both  true  (left) 
and  false  queries  (right)  for  varying  numbers  of  entities.  Figure [L4| shows  the  corresponding 
average  strength  of  effects  for  true  (left)  and  false  (right)  queries.  The  graph  uses  a  standard 
box- and- whisker  plot  with  values  greater  or  less  than  1.5  times  the  inner  quartile  range — the 
difference  between  the  upper  and  lower  quartiles — marked  as  outliers. 

In  the  vast  majority  of  cases,  relational  d-separation  is  consistent  with  tests  on  actual 
data  (i.e. ,  most  d-separated  queries  have  low  effect  sizes  and  are  rarely  deemed  significant, 
whereas  most  d-connected  queries  have  high  effect  sizes  and  are  mostly  deemed  significant). 
For  approximately  23,000  true  queries,  14.9%  are  significant  in  more  than  one  trial,  but  most 
are  insubstantive,  with  only  2.2%  having  an  average  effect  size  greater  than  0.01.  There 
are  three  potential  reasons  why  a  d-separation  in  theory  may  appear  to  be  d-connected 
in  practice:  (1)  Type  I  error;  (2)  high  power  given  a  large  sample  size;  or  (3)  bias.  We 
have  discovered  that  a  small  number  of  cases  exhibit  an  interaction  between  aggregation 
and  relational  structure  (i.e.,  degree  or  the  cardinality  of  relational  variable  instances). 
This  interaction  violates  the  identically  distributed  assumption  of  data  instances,  which 
produces  a  biased  estimate  of  effect  size  for  simple  linear  regression.  Linear  regression  does 
not  account  for  these  interaction  effects,  suggesting  the  need  for  more  accurate  statistical 
tests  of  conditional  independence  for  relational  data. 


8.  Model  Assumptions  and  Related  Work 

The  class  of  relational  models  considered  in  Section  |4j  while  strictly  more  expressive  than 
Bayesian  networks,  has  limitations  in  its  current  formalization.  In  this  section,  we  highlight 
these  assumptions  and  discuss  how  related  and  future  work  could  address  them. 
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Self-relationships:  Self-relationships  are  relationship  classes  that  involve  the  same  entity 
class  more  than  once.  Relational  schemas,  as  defined  in  Definition  4.1 
types  of  relationships.  Only  the  definition  of  relational  paths 
variables  and  dependencies- 


can  express  these 
which  govern  the  space  of 
requires  unique  entity  class  names  within  [E,R,E]  triples  (see 


condition  (2)  of  Definition  4.3 ).  However,  a  common  procedure  in  entity-relationship  model¬ 
ing  is  to  map  entity  names  to  unique  role  indicators  within  the  context  of  a  self-relationship, 


such  as  manager/subordinate,  friendl/friend2,  or  citing-paper/cited-paper  (Ramakrishnan 


and  Gehrke,  2002).  This  approach  does  not  duplicate  entity  instances  in  the  skeleton  or 


ground  graph;  it  only  modifies  their  reference  names  within  the  relational  path,  requiring 
extended  semantics  for  terminal  sets.  Incorporating  self-relationships  is  a  straightforward 
extension,  but  for  simplicity,  we  omit  this  additional  layer  of  complexity. 

Relational  autocorrelation :  In  contrast  to  self-relationships,  relational  autocorrelation 
is  a  statistical  dependency  among  the  values  of  the  same  attribute  class  frequently  found 
in  relational  data  sets  (Jensen  and  Neville,  2002).  Various  models  and  learning  algorithms 


have  been  developed  to  capture  these  types  of  dependencies,  such  as  RDNs  (Neville  and 


Jensen 


2007),  PBNs  with  an  extended  normal  form  (Schulte  et  al.[|2012[),  and  PRMs  with 


dependencies  that  follow  guaranteed  acyclic  relationships  (Getoor  et  al.,  2007).  Our  for¬ 


malism,  and  equivalently  PRMs  (without  guaranteed  acyclic  relationships),  can  represent  a 
class  of  models  for  apparent  autocorrelation.  Any  relational  dependency  that  yields  a  com¬ 
mon  cause  for  grounded  variables  of  the  same  attribute  class — essentially  any  dependency 
that  crosses  a  MANY  cardinality — produces  relational  autocorrelation.  The  only  autocor¬ 
relations  not  accounted  for  involve  latent  causes  or  those  produced  by  temporal  processes 
(e.g.,  feedback). 

Context- specific  independence:  Context-specific  independence  (CSI)  introduces  indepen¬ 
dence  of  some  variable  and  its  parents,  depending  on  the  values  of  other  variables.  This  can 
be  achieved  within  the  specification  of  conditional  probability  distributions  as  if-then-else 
statements  of  logical  conditions,  such  as  in  DAPER  models  (Heckerman  et  al.  2007)  or 


RPMs  (|Russell  and  Norvlg  2010),  encoded  as  regularities  in  conditional  probability  tables 


(Boutilier  et  al.  1996),  or  with  the  recent  graphical  convention  of  gates  (Minka  and  Winn 


2009).  However,  this  introduces  a  notion  of  independence  that  cannot  be  inferred  from 
model  structure  via  traditional  d-separation.  In  fact,  Boutilier  et  al.  (1996)  define  an  anal¬ 


ogous  approach  based  on  d-separation  of  a  manipulated  Bayesian  network  through  deletion 
of  vacuous  dependencies  given  some  context.  Winn  (2012)  extends  the  rules  of  d-separation 
to  reason  over  the  additional  paths  and  their  collective  state  introduced  by  gates.  An  al¬ 
ternative  and  more  general  approach  to  encoding  CSIs  is  to  develop  an  ontology  for  which 
(in)dependencies  hold  depending  on  the  type  of  entity  or  relationship.  PRMs  with  class 
hierarchies  allow  a  hierarchy  of  entity  types  where  the  dependency  structure  can  vary  de¬ 
pending  on  the  type  (Getoor  et  al. ,  2000).  Rules  of  inheritance  derived  from  object-oriented 
programming  are  used  to  define  a  coherent  joint  probability  distribution.  This  aligns  with 
our  formalism,  as  relational  schemas  can  be  viewed  as  an  ontology  defined  at  a  particular 
level.  However,  the  semantics  of  d-separation  under  inheritance  has  not  been  developed 
and  is  a  profitable  direction  of  future  research. 

Causes  of  entity  and  relationship  existence:  Without  a  generative  model  of  relational 
skeletons,  the  relational  models  are  not  truly  generative  as  the  skeleton  must  be  generated 
prior  to  the  attributes.  However,  the  same  issue  occurs  for  Bayesian  networks:  Relational 
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skeletons  consist  of  disconnected  entity  instances,  but  the  model  does  not  specify  how  many 
instances  to  create.  There  are  relational  models  that  attempt  to  learn  and  represent  models 


with  unknown  numbers  of  entity  instances,  such  as  Blog  (Milch  et  ah,  2005),  or  uncer¬ 


tain  relationship  instances,  such  as  PRMs  with  existence  uncertainty  (Getoor  et  al.  2002). 
However,  reasoning  about  the  connection  between  conditional  independence  and  existence 
is  an  open  problem.  For  relationship  existence,  selection  bias  (conditioning)  occurs  when 


testing  marginal  dependence  between  variables  across  a  particular  relationship  (Maier  et  al. 


2010).  For  entity  existence,  some  researchers  argue  that  existence  cannot  be  represented 


as  a  variable  or  predicate  (Poole,  2007),  while  others  represent  them  as  predicates  (Laskey 


2008).  Therefore,  we  currently  choose  simple  processes  for  generating  skeletons,  allowing 


us  to  focus  on  relational  models  of  attributes  and  leaving  structural  causes  and  effects  as 
future  work. 

Causal  sufficiency:  The  relational  models  we  consider  assume  that  all  common  causes  of 
observed  variables  are  also  observed  and  included  in  the  model — an  assumption  commonly 
referred  to  as  causal  sufficiency.  Many  researchers  have  developed  methods  for  learning 
and  inference  by  explicitly  modeling  unobserved  variables — typically  termed  latent  variable 
models  (Bishop,  1999) — or  inferring  the  presence  of  latent  entity  classes-  for  example,  la¬ 
tent  group  models  (Neville  and  Jensen,  2005).  However,  only  ancestral  graphs  and  acyclic 
directed  mixed  graphs  (ADMGs)  do  so  in  order  to  preserve  an  underlying  conditional  in¬ 
dependence  structure  (Richardson  and  Spirtes,  2002;  Richardson,  2009).  These  models 


are  paired  with  the  theory  of  m-separation,  which  is  a  generalization  of  d-separation  for 
Bayesian  networks.  The  generalization  of  ancestral  graphs  or  ADMGs  to  relational  mod¬ 
els  requires  extensive  theoretical  exploration;  therefore,  we  leave  this  as  an  important  di¬ 
rection  for  future  work.  Given  that  a  primary  motivation  for  d-separation  is  to  support 
constraint-based  causal  discovery,  any  relational  extension  to  algorithms  that  learn  causal 


models  without  assuming  causal  sufficiency,  such  as  FCI  (Spirtes  et  al., 

1995 

Zhang, 

to 

o 

o 

00 

its  variants  ( 

Claassen  and  Heskes,  2011;  Colombo  et  al.,  2012),  and  B 

CCD 

Claassen  and 

Heskes 

2012 

),  would  require  such  an  extension  to  m-separation. 

Temporal  and  cyclic  models:  Currently,  the  relational  model  is  assumed  to  be  acyclic 
(with  respect  to  the  class  dependency  graph),  and  consequently,  atemporal.  Model- level 
cycles  typically  result  from  temporal  processes  for  which  grounding  across  time  would  yield 


an  acyclic  ground  graph,  such  as  in  dynamic  Bayesian  networks  (Dean  and  Kanazawa,  1989 


Murphy  2002).  However,  cycles  can  also  be  due  to  temporal  processes  where  the  interac¬ 
tion  occurs  at  a  faster  rate  than  measurement.  As  a  result,  there  has  been  considerable 
attention  devoted  to  models  that  explicitly  encode  cyclic  dependencies,  such  as  the  work 


by  Spirtes 

(1995) 

Pearl  and  Dechter 

(1996 

(2012 

1 

Richardson 

(1996) 

Dash 

(2005 

),  Schmidt  and 

Murphy 

(- 

1009),  and 

Hyttinen  et  al. 

).  Our  formalism  currently  prohibits  any  rela- 

tional  dependency  that  has  a  common  attribute  class  for  the  cause  and  effect,  regardless 
of  the  relational  path  constraint.  Relaxing  this  assumption  would  require  either  explicitly 
modeling  temporal  dynamics  or  enabling  feedback  loops.  We  reserve  temporal  dynamics 
and  feedback  as  another  important  avenue  for  future  research. 

Despite  these  assumptions,  our  current  work  extends  the  notion  of  d-separation  to  a 
much  more  expressive  class  of  models  than  Bayesian  networks.  This  work  is  a  first  step 
toward  deriving  conditional  independencies  from  expressive  classes  of  models.  Incorporating 
existence,  ontologies,  temporal  dynamics  and  feedback,  and  latent  variables  into  our  model 
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is  important  future  work,  especially  in  the  context  of  representing  and  learning  causal 
models  of  realistic  domains. 


9.  Discussion 

In  this  paper,  we  extend  the  theory  of  d-separation  to  graphical  models  of  relational  data. 
We  present  the  abstract  ground  graph ,  a  new  representation  that  is  sound  and  complete  in 
its  abstraction  of  dependencies  across  all  possible  ground  graphs  of  a  given  relational  model. 
We  formally  define  relational  d-separation  and  offer  a  sound,  complete,  and  computation¬ 
ally  efficient  approach  to  deriving  conditional  independence  facts  from  relational  models  by 
exploiting  their  abstract  ground  graphs.  We  also  show  that  relational  d-separation  is  equiv¬ 
alent  to  the  Markov  condition  for  relational  models.  We  provide  an  empirical  analysis  of 
relational  d-separation  on  synthetic  data,  demonstrating  a  close  correspondence  between  the 
theory  and  statistical  results  in  practice.  Finally,  we  evaluate  how  frequently  the  additional 
complexity  of  abstract  ground  graphs  proves  necessary  for  accurately  deriving  conditional 
independence  facts. 

The  results  of  this  paper  imply  potential  flaws  in  the  design  and  analysis  of  some  real- 
world  studies.  If  researchers  of  social  or  economic  systems  choose  inappropriate  data  and 
model  representations,  then  their  analyses  may  omit  important  classes  of  dependencies. 
Specifically,  our  theory  implies  that  choosing  a  propositional  representation  from  an  in¬ 
herently  relational  domain  may  lead  to  serious  errors.  An  abstract  ground  graph  from  a 
given  perspective  defines  the  exact  set  of  variables  that  must  be  included  in  any  propo- 
sitionalization.  The  absence  of  any  relational  variable  (including  intersection  variables) 
may  unnecessarily  violate  causal  sufficiency,  which  could  result  in  the  inference  of  a  causal 
dependency  where  conditional  independence  was  not  detected.  Our  work  indicates  that 
researchers  should  carefully  consider  how  to  represent  their  domains  in  order  to  accurately 
reason  about  conditional  independence. 

The  abstract  ground  graph  representation  also  presents  an  opportunity  to  derive  new 
edge  orientation  rules  for  algorithms  that  learn  the  structure  of  relational  models,  such  as 


RPC  (Maier  et  al.,  2010)  and  RCD  (Maier  et  al.,  2013).  There  are  unique  orientations  of 


edges  that  are  consistent  with  a  given  pattern  of  association  that  can  only  be  recognized  in 
an  abstract  ground  graph.  For  example,  in  contrast  to  bivariate  IID  data,  it  is  simple  to 
establish  the  direction  of  causality  for  bivariate  relational  data.  Consider  the  two  bivariate, 
two-entity  relational  models  depicted  in  Figure  [iTTja).  The  first  model  implies  that  values 
of  A  on  A  entities  are  caused  by  the  values  of  Y  on  related  B  entities.  The  second  model 
implies  the  opposite,  that  values  of  Y  on  B  entities  are  caused  by  the  values  of  X  on  related 
A  entities.  For  simplicity,  we  show  the  relationship  class  only  as  a  dashed  line  between 
entity  classes  and  omit  it  from  relational  paths. 

Figure |9T)(b)  illustrates  a  fragment  of  the  abstract  ground  graph  (for  hop  threshold  h= 4) 
that  each  of  the  two  relational  models  implies.  As  expected,  the  directions  of  the  edges  in 
the  two  abstract  ground  graphs  are  counterposed.  Both  models  produce  observable  statisti¬ 
cal  dependencies  for  relational  variable  pairs  ([B].Y,  [B,A].X)  and  ([B,A\.X,  [B,A,B\.Y). 
However,  the  relational  variables  [R].T  and  [B,A,B].Y  have  different  observable  statisti¬ 
cal  dependencies:  In  the  first  model,  they  are  marginally  independent  and  conditionally 
dependent  given  [B,A].X,  and  in  the  second  model,  they  are  marginally  dependent  and 
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a 

[A  B].Y  — >  [A].X 


B 


[B  A].X  — >  [B].Y 


(a) 


Figure  9.1:  (a)  Two  models  of  a  bivariate  relational  domain  with  opposite  directions  of 
causality  for  a  single  dependency  (relationship  class  omitted  for  simplicity);  (b)  a  single 
dependency  implies  additional  dependencies  among  arbitrary  relational  variables,  shown 
here  in  a  fragment  of  the  abstract  ground  graph  for  £>  ’ s  perspective;  (c)  an  example  relational 
skeleton;  and  (d)  the  ground  graphs  resulting  from  applying  the  relational  model  to  the 
skeleton. 


conditionally  independent  given  [B,A\.X.  As  a  result,  we  can  uniquely  determine  the  di¬ 
rection  of  causality  of  the  single  dependence  by  exploiting  relational  structure.  (There  is 
symmetric  reasoning  for  relational  variables  from  A’s  perspective,  and  this  result  is  also 
applicable  to  ONE-to-MANY  data.) 


To  illustrate  this  fact  more  concretely,  consider  the  small  relational  skeleton  shown  in 
Figure  |9T)(c)  and  the  ground  graphs  applied  to  this  skeleton  in  Figure  |9T)(d).  In  the  first 
ground  graph,  we  have  y\  _LL  1/2  and  y\  y2  |xi,  but  in  the  second  ground  graph,  we  have 
yi  ijf  y2  and  y\  _LL  7/2  \x\.  These  opposing  conditional  independence  relations  uniquely 
determine  the  correct  causal  model.  In  prior  work,  we  formalized  this  idea  as  a  new  rule, 
called  relational  bivariate  orientation  (RBO)  (Maier  et  ah,  2013),  to  orient  dependencies  in 
a  constraint-based  causal  discovery  algorithm. 


Deriving  and  formalizing  the  implications  of  relational  d-separation  is  a  main  direction 
of  future  research.  Additionally,  our  experiments  suggest  that  more  accurate  tests  of  con¬ 
ditional  independence  for  relational  data  need  to  be  developed,  specifically  tests  that  can 
address  the  interaction  between  relational  structure  and  aggregation  across  terminal  sets  of 
relational  variables.  This  work  has  also  focused  solely  on  relational  models  of  attributes; 
future  work  should  consider  models  of  relationship  and  entity  existence  to  fully  characterize 
generative  models  of  relational  structure.  The  theory  could  also  be  extended  to  incorporate 
functional  or  deterministic  dependencies,  as  D-separation  extends  d-separation  for  Bayesian 
networks.  Finally,  the  work  on  identifying  causal  effects  in  Bayesian  networks  could  be  ex¬ 
tended  to  relational  models.  This  may  similarly  require  an  extension  of  do-calculus  to 
consider  the  space  of  relational  interventions,  which  may  include  adding  or  removing  entity 
or  relationship  instances,  as  well  as  fixing  attribute  values. 
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Figure  A.l:  Schematic  of  two  relational  paths  Pi  and  P2  for  which  Lemma  4.1  guarantees 
that  some  skeleton  a  yields  a  non-empty  intersection  of  their  terminal  sets.  The  exam¬ 
ple  depicts  a  possible  constructed  skeleton  based  on  the  procedure  used  in  the  proof  of 


Lemma  4.1 
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Appendix  A.  Proofs 

In  this  appendix,  we  provide  detailed  proofs  for  all  previous  lemmas,  theorems,  and  corol¬ 
laries. 


Lemma  14.11  For  two  relational  paths  of  arbitrary  length  from  Ij  to  I that  differ  in  at  least 
one  item  class,  P\  =  [Ij .....  Im, ... ,  A]  and  P2  =  [Ij, . . .  ,In, . . . ,  A]  with  Im  /  Inj  there 
exists  a  skeleton  a  E  S5  such  that  Pi\ij  <1  P2 \i3  /  0  for  some  ij  E  cr(Ij). 


Proof.  Proof  by  construction.  Let  S  be  an  arbitrary  schema  with  two  arbitrary  relational 
paths  Pi  =  [Ij, ....  Im, ,  Ik]  and  P2  =  [Ij,  . . .  ,In, . . .  ,Ik\  where  Im  /  In.  We  will  con¬ 
struct  a  skeleton  a  E  S5  such  that  the  terminal  sets  for  item  ij  E  cr(Ij)  along  Pi  and  P2 
have  a  non-empty  intersection,  that  is,  an  item  4  E  Pi  \ * .  FI  P2\ij  /  0  (roughly  depicted  in 
Figure  A.l).  We  use  the  following  procedure  to  build  a: 
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ll  •  •  lr 


Pori 


rig 


lm 

lj 


Ik 

Pext 


ii  •  •  im  •  •  ik 


im1 

ii 


Figure  A. 2:  Example  construction  of  a  relational  skeleton  for  two  relational  paths  Porig  = 
[I'l ,  .  .  .  ,  Im 


ong 

,  Ij\  and  Pext  =  [Ij, . . . ,  Im  . . . ,  Ik],  where  item  class  Im  is  repeated  between 


Im  and  I; .  This  construction  is  used  within  the  proof  of  Lemma 


5.1 


1.  Simultaneously  traverse  P\  and  P2  from  Ij  until  the  paths  diverge.  For  each  entity 
class  E  E  £  reached,  add  a  unique  entity  instance  e  to  &(E). 

2.  Simultaneously  traverse  Pi  and  Po  backwards  from  Ik  until  the  paths  diverge.  For 
each  entity  class  E  E  £  reached,  add  a  unique  entity  instance  e  to  cr(E). 

3.  For  the  divergent  subpaths  of  both  Pi  and  P2,  add  unique  entity  instances  for  each 
entity  class  E  E  £■ 

4.  Repeat  1-3  for  relationship  classes.  For  each  R  E  TZ  reached,  add  a  unique  relationship 
instance  r  connecting  the  entity  instances  from  classes  on  P\  and  Po,  and  add  unique 
entity  instances  for  classes  E  E  R  not  appearing  on  Pi  and  P2. 


This  process  constructs  an  admissible  skeleton — all  instances  are  unique  and  this  process 
assumes  no  cardinality  constraints  aside  from  those  required  by  Definition  4.3  By  construc¬ 
tion,  there  exists  an  item  ij  E  &{Ij)  such  that  P\  |, .  n  P2L  =  {ik}  /  0.  ■ 


Lemma  15.11  Let  Porig  =  [L ,  ■  •  •  ■  Ij]  and  Pext  =  [Ij, ,  Ik\  be  two  relational  paths  with 
P  =  extend(Porig,  Pext) ■  Then,  VP  E  P  there  exists  a  relational  skeleton  a  E  E5  such  that 
3*1  E  cr(ii)  such  that  3 ik  E  P|q  and  3 ij  E  Porig |q  such  that  ik  E  Pext\ir 


Proof.  Let  P  E  P  be  an  arbitrary  valid  relational  path,  where  P  =  P 


l,n0— c+1 
orig 


+  p: 


,c+l  ,ne 
ext 


=  [I\, . . . ,  Ij, . . . ,  Ik]-  This  subcase  holds  generally  for  any  skeleton. 


for  pivot  c.  There  are  two  subcases: 

(a)  c  =  1  and  P  =  [I\, . . . ,  Ij, . . . ,  I, 

Proof  by  contradiction.  Let  a  be  an  arbitrary  skeleton,  choose  i\  E  <r(/i)  arbitrarily,  and 
choose  ik  E  P\t]  arbitrarily.  Assume  for  contradiction  that  there  is  no  ij  in  the  terminal  set 
Porig |q  such  that  ik  would  be  in  the  terminal  set  Pext\ij,  that  is,  Mij  E  P0rig\h  ik  ^  Pext  \ i,  ■ 
Since  P  =  [I\, . . .  ,Ij, . . .  ,Ik\,  we  know  that  ik  is  reached  by  traversing  a  from  i\  via  some 
ij  to  ik ■  But  the  traversal  from  i\  to  ij  implies  that  ij  E  [P 1 , . . .  ,Ij] |q  =  Porig |i1;  and  the 
traversal  from  ij  to  ik  implies  that  ik  E  [Ij, . . .  ,Ik]\ij  =  Pext\ij-  Therefore,  there  must  exist 
an  ii  E  Poria\i ,  such  that  ik  E  Pext\ii- 


oj  ±  ong\i\ 

(b)  c  >  1 


and  P  =  [h,...,In 


Ik].  Proof  by  construction.  We  build  a  rela¬ 


tional  skeleton  a  following  the  same  procedure  as  outlined  in  the  proof  of  Lemma  4.1 
Add  instances  to  a  for  every  item  class  that  appears  on  Porig  and  Pext ■  Since  P  = 
[Ii, . . . ,  Im, . . . ,  Ik],  we  know  that  ik  is  reached  by  traversing  o  from  i\  via  some  im  to 
ik-  By  case  (a),  3 im  E  [Ii, ... ,  Im] |q  such  that  ik  E  [Im, . .  ■ ,  Ik]\im-  We  then  must  show  that 


40 


Independence  in  Models  of  Relational  Data 


Pext 


Figure  A. 3:  Schematic  of  the  relational  paths  expected  in  Lemma 


5.2 


If  item 


4  is  unreachable  via  extend(P( 

[I !)■■■)  Im  >  •  •  ■  j  4  j  ■  ■  •  j  Ij\  ■ 


omg 


,  P ext )  j  then  there  must  exist  a  P'ong 


of  the  form 


there  exists  an  ij  G  [Im, . . . ,  Ij]\irn  with  im  E  [Ij, . . . ,  .  But  constructing  the  skeleton 

with  unique  item  instances  for  every  appearance  of  an  item  class  on  the  relational  paths 
provides  this  and  does  not  violate  any  cardinality  constraints.  If  any  item  class  appears 
more  than  once,  then  the  bridge  burning  semantics  are  induced.  However,  adding  an  addi¬ 
tional  item  instance  for  every  reappearance  of  an  item  class  enables  the  traversal  from  ij  to 
im  and  vice  versa.  An  example  of  this  construction  is  displayed  in  Figure  [A~2|  This  is  also  a 
valid  relational  skeleton  because  Porig  and  Pext  are  valid  relational  paths,  and  by  definition, 
the  cardinality  constraints  of  the  schema  permit  multiple  instances  in  the  skeleton  of  any 
repeated  item  class.  By  this  procedure,  we  show  that  there  exists  a  skeleton  a  such  that 
there  exists  an  ij  G  Porig Iq  such  that  if,.  G  Pext\iy  ■ 


Lemma  15.21  Let  a  G  S5  be  a  relational  skeleton,  and  let  Porig  =  [L ,  •  •  • ,  Ij\  and  Pext  = 
[Ij, ,  4]  be  two  relational  paths  with  P  =  extend (Porig,  Pext)-  Then,  \/i±  G  cr(4)  Mij  G 
Poriglh  T'ik  G  Pext  |q  */ VP  £  P  Ik  T>\i1,  then  3 P orig  SUch  that  Poriglh  Fl  P orig  1*1  7^  ®  and 
ik  e  P'lij  for  some  P'  G  extend (P'orig,  Pext). 


omg\i\  ? 


and  ik  E  Pext\i,  be  arbitrary 


Proof.  Proof  by  construction.  Let  i\  G  cr(Ii)  j  Ij  ^  T, 
instances  such  that  ik  ^  P|q  for  all  P  G  P. 

Since  ij  G  Poriglh  and  ik  G  Pext\ij,  but  ik  ^  P\h,  there  exists  no  pivot  that  yields  a 
common  subsequence  in  Porig  and  Pext  that  produces  a  path  in  extend  that  can  reach  ik- 
Let  the  first  divergent  item  class  along  the  reverse  of  Porig  be  and  along  Pext  be  In.  The 
two  paths  must  not  only  diverge,  but  they  also  necessarily  reconverge  at  least  once.  If 
Porig  and  Pext  do  not  reconverge,  then  there  are  no  reoccurrences  of  an  item  class  along  any 
P  G  P  that  would  restrict  the  inclusion  of  ik  in  some  terminal  set  P\ q .  The  sole  reason  that 
ik  ^  P In  for  all  P  G  P  is  due  to  the  bridge  burning  semantics  specified  in  Definition  4.4 

Without  loss  of  generality,  assume  Porig  and  Pext  reconverge  once,  at  item  class  Im. 


Let  P'. 


,  I A  and  P<: 


l]\  “11U  *'ext  ~  [Iji  ■  ■  ■  : 

orig  [■^1  ?  *  *  *  5  k-rn  ?  ■  •  ■  ;  In ; 

path  because  [Ji, . . . ,  lm }  is  a  subpath  of  Porig  and  [ Im , . . . ,  In, 


So,  P orig  —  [  l 
depicted  in  Figure  A. 3 


In  i  •  •  •  1 1m  i  •  •  ■  ;  4]  with  I  I  /  In,  as 
. . . ,  Ij\,  which  is  a  valid  relational 


/ 


is  a  subpath  of  Pext- 


By  construction,  ij  G  Poriglh  n  Kriglh- 
with  pivot  at  Im ,  then  ik  G  P'\n  ■  ■ 


If  P'  =  [4 


■  ■  ;  Ijn >  •  •  •  j  4]  G  extend (P'orig  5  P ext ) 
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Theorem  15.21  For  every  acyclic  relational  model  structure  A4  and  perspective  B  £  £  U  1Z, 
the  abstract  ground  graph  AGGmb  is  sound  and  complete  for  all  ground  graphs  GGmct  with 
skeleton  a  £  S5. 


Proof.  Let  Xi  =  (S,  T>)  be  an  arbitrary  acyclic  relational  model  structure  and  let  B  £  £\JTZ 
be  an  arbitrary  perspective. 

Soundness:  To  prove  that  AGGmb  is  sound,  we  must  show  that  for  every  edge  Pk-X  — > 
Pj.Y  in  AGGmb ,  there  exists  a  corresponding  edge  ik.X  — y  ij.Y  in  the  ground  graph  GGm<j 
for  some  skeleton  a  £  S5,  where  ik  £  Pk\b  and  ij  £  Pj \ i>  for  some  b  £  a{B).  There  are  three 
subcases,  one  for  each  type  of  edge  in  an  abstract  ground  graph: 


(a)  Let  [B, . . . ,  If.X  — >  [ B , . . . ,  Ij\.Y  £  RVE  be  an  arbitrary  edge  in  AGGmb  between  a 
pair  of  relational  variables.  Assume  for  contradiction  that  there  exists  no  edge  ik-X  — »  ij.Y 
in  any  ground  graph: 


Vu£ElS  Vbea(B)  Vik  £  [B,  ...,Ik]\b  Vij  £  [B, ...,If  \b  (ik.X  -a  ij.Y  £  GGM* ) 

By  Definition  5.2  for  abstract  ground  graphs,  if  [B, . . . ,  Ik].X  -»  [B,...,Ij].Y  £  RVE, 
then  the  model  must  have  dependency  [Ij, ... ,  Ik\. X  — >  [Ij\-Y  £  V  such  that  [ B , . . . ,  Ik\  £ 
extendfB, ...  ,Ij],  [I j, ... ,  Ik]).  So,  by  Definition  4.9  for  ground  graphs,  there  is  an  edge 
from  every  ik-X  to  every  ij.Y,  where  ik  is  in  the  terminal  set  for  ij  along  [Ij, . .  .,Ik]: 

Va  £  Vij  £  cr(Ij)  Vik  £  \Ij,  ■  ■  ■  ,Ik]\ij  (ik-X  — >  ij-Y  £  GGmct) 

Since  [B, . ...  Ik]  £  extend([B, . ...  If,  [Ij, . . . ,  If),  by  Lemma  5.1  we  know  that 

3a  £  Ys  3b  £  a(B)  3ik  £  [B, . . . ,  Ik\\b  3 ij  £  [B, . . . ,  If] |6  (ik  £  [Ij, . . . ,  h]\ij) 


Therefore,  there  exists  a  ground  graph  GGmu  such  that  ik-X  — >  ij.Y  £  GGmct ,  which 
contradicts  the  assumption. 


(b)  Let  Pi  .X  n  P-2-X  — >  [B, . . . ,  Ij\-Y  £  IVE  be  an  arbitrary  edge  in  AGGmb  between 
an  intersection  variable  and  a  relational  variable,  where  Pi=  [B, . . . ,  Im, . . . ,  If  and  P2  = 
[B, . . . ,  In, . . . ,  If  with  Im  7^  In.  By  Lemma  4.1,  there  exists  a  skeleton  a  £  S5  and 


b  £  a(B)  such  that  Pi\b  H  P2I6  f  0-  Let  ik  £  P\\b  n  Po\b  and  assume  for  contradiction  that 
for  all  ij  £  [B,...,Ij]\b  there  is  no  edge  ik.X  -A  ij.Y  in  the  ground  graph  GGmct-  By 


Definition 


5.2 


if  the  abstract  ground  graph  has  edge  P\  .X  n  P2-X 
[B,...,  Ij].Y  £  RVE  or  P2.X  [B,...,  Ij 


then  either  P\  .X 

shown  in  case  (a),  there  exists  an  ij  £  [B, . . . ,  If  \b  such  that  ik-X  — >  i 
contradicts  the  assumption. 


£  [B,...,If.Y  GIVE, 
Y  £  RVE.  Then,  as 


-Y  £  GGmc r,  which 


(c)  Let  [B,...,If.X  -A  Pi.Y  n  P2-Y  £  IVE  be  an  arbitrary  edge  in  AGGmb  be¬ 
tween  a  relational  variable  and  an  intersection  variable,  where  P\  =  [B, . . .  ,Im, . . .  ,Ij\  and 
P2  =  [B, ...  ,In, ...  ,Ij]  with  Irn  f  In.  The  proof  follows  case  (b)  to  show  that  there  exists  a 
skeleton  a  £  S5  and  b  £  a(B)  such  that  for  all  ik  £  [B, . . . ,  Iff  there  exists  an  ij  £  -Pini^lfc 
such  that  ik-X  — >  ij.Y  £  GGmct- 


Completeness:  To  prove  that  the  abstract  ground  graph  AGGmb  is  complete,  we 
show  that  for  every  edge  ik.X  — >  ij.Y  in  every  ground  graph  GGm<j  where  a  £  S5,  there 
is  a  set  of  corresponding  edges  in  AGGmb-  Specifically,  the  edge  ik-X  -A  ij.Y  yields  two 
sets  of  relational  variables  for  some  b  £  a(B),  namely  Pk-X  =  {Pk.X  \  ik  £  Pfb}  and 
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Pj.Y  =  {Pj.Y  |  ij  £  Pj\b}-  Note  that  all  relational  variables  in  both  Pk.X  and  Pj.Y  are 
nodes  in  AGGmb ,  as  are  all  pairwise  intersection  variables:  VP&.Y,  Pj.Y  £Pk.X  (Pj..Y  n 
Pj.Y  £  AGGmb )  and  VPj.Y,  Pj.Y  £  Pj.Y  (Pj.Y  n  Pj.Y  £  AGG^b).  We  show  that 
for  all  Pk.X  £  Pk.X  and  for  all  Pj.Y  £  Pj.Y  either  (a)  Pk.X  ->  Pj.Y  £  iGGxB,  (b) 
Pfc.Y  n  Pj.Y  ->  Pj.Y  £  AGGmb,  where  Pj.Y  £  Pk.X,  or  (c)  Pfc.Y  ->•  Pj.Y  n  Pj.Y  £ 
AGGmb,  where  Pj.Y  £  Pj.Y. 


Let  (T  £  S5  be  an  arbitrary  skeleton,  let  ik.X  — y  ij.Y  £  GGmu  be  an  arbitrary  edge 
drawn  from  [Ij, . . . ,  Ik] .X  — >  [Ij] .Y  £  T>,  and  let  Pk-X  £  Pk.X,  Pj.Y  £  Pj.Y  be  an  arbitrary 
pair  of  relational  variables. 

(a)  If  Pk  £  extend(Pj,  [Ij, . . .  ,Ik]),  then  Pk-X  — >  Pj.Y  £  AGGmb  by  Definition  5.2 

(b)  If  Pk  £  extend (Pj,  [Ij, ... ,  Ik]),  but  3Pj  £  extend[Pj,  [Ij, ... ,  Ik})  such  that  P'k. X  £ 
Pk.X,  then  P'k.X  ->  Pj.Y  £  AGGMb,  and  Pk.X  n  Pj.Y  ->  Pj.Y  £  AGGMb  by  Defini¬ 


tion 


(c)  If  VP  £  extend(Pj,  [Ij, . . .  ,Ik])  (PY  ^  Pk.X),  then  by  Lemma 
ij  £  Pj 1 5  and  Pk  £  extend(Pj,  [Ij, . . . ,  7^]).  Therefore,  Pj-Y  £  Pj.Y, 
AGGmb,  and  Pk. X  — >•  Pj-Y  n  Pj-Y  £  AGGmb  by  Definition 


5.2 


5.2 


3Pj  such  that 
.Y  ->•  Pj.Y  £ 


Theorem  15.31  For  every  acyclic  relational  model  structure  M  and  perspective  B  £  £  UTZ, 
the  abstract  ground  graph  AGGmb  is  directed  and  acyclic. 


Proof.  Let  JA  be  an  arbitrary  acyclic  relational  model  structure,  and  let  B  £  £  U IZ  be  an 
arbitrary  perspective.  It  is  clear  by  Definition  5.2  that  every  edge  in  the  abstract  ground 
graph  AGGmb  is  directed  by  construction.  All  edges  inserted  in  any  abstract  ground  graph 
are  drawn  from  the  directed  dependencies  in  a  relational  model.  Since  M  is  acyclic,  the 
class  dependency  graph  Gm  is  also  acyclic  by  Definition  |4.10|  Assume  for  contradiction 


that  there  exists  a  cycle  of  length  n  in  AGGmb  that  contains  both  relational  variables 
and  intersection  variables.  By  Definition |5.2|  all  edges  inserted  in  AGGmb  are  drawn  from 
some  dependency  in  M,  even  for  nodes  corresponding  to  intersection  variables.  Retaining 
only  the  final  item  class  in  each  relational  path  for  every  node  in  the  cycle  will  yield  a  cycle 
in  Gm  by  Definition  4.10  Therefore,  AA  could  not  have  been  acyclic,  which  contradicts  the 
assumption.  ■ 


Appendix  B.  The  Semantics  of  Bridge  Burning 

In  this  appendix,  we  provide  an  example  to  show  that  the  bridge  burning  semantics  for 
terminal  sets  of  relational  paths  yields  a  strictly  more  expressive  class  of  relational  models 
than  semantics  without  bridge  burning.  The  bridge  burning  semantics  produces  terminal 
sets  that  are  necessarily  subsets  of  terminal  sets  which  would  otherwise  be  produced  without 
bridge  burning.  Paradoxically,  this  enables  a  superset  of  relational  models. 

Recall  the  definition  of  a  terminal  set  for  a  relational  path: 


Definition  4.4  (Terminal  set)  For  skeleton  er  £  and  ij  £  cr (Ij),  the  terminal  set  P\ 
for  relational  path  P  =  [Ij, . . . ,  Ik]  of  length  n  is  defined  inductively  as 

^>1|  ij  =  [-(?']  lb  =  {*7} 
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(a)  Relational  model 


(b)  Relational  skeleton 


(c)  Ground  graphs 


Figure  B.l:  Example  demonstrating  that  bridge  burning  semantics  yields  a  more  expres¬ 
sive  class  of  models  than  semantics  without  bridge  burning,  (a)  Relational  model  over  a 
schema  with  two  entity  classes  and  two  attributes  with  two  possible  relational  dependencies 
(relationship  class  omitted  for  simplicity),  (b)  Simple  relational  skeleton  with  three  A  and 
three  B  instances,  (c)  Bridge  burning  semantics  yields  three  possible  ground  graphs  with 
combinations  of  dependencies  (1)  and  (2),  whereas  no  bridge  burning  yields  two  possible 
ground  graphs.  The  bridge  burning  ground  graphs  subsume  the  ground  graphs  without 
bridge  burning. 


p%  =  ikUi,  =  u  [ik 

im£Pn~1\ij 


The  final  condition  in  the  inductive  definition  (*&  ^  [I i, . . . .  Ij]\tl  for  j  =  1  to  k  —  1) 
encodes  bridge  burning.  The  item  is  only  added  to  the  terminal  set  if  it  is  not  a  member 
of  the  terminal  set  of  any  previous  subpath.  For  example,  let  P  be  the  relational  path 
[Employee,  Develops,  Product,  Develops,  Employee],  This  relational  path  produces 
terminal  sets  that  include  the  employees  that  work  on  the  same  products  (that  is,  co¬ 
workers).  Instantiating  this  path  with  the  employee  Quinn,  P\ QUinn,  produces  the  terminal 


((fm  £  if  Ik  £  P)  V  (ifc  £  im  if  Ik  £  £)) 

n—  1 

A  ik  ^  U  Pl\ij} 

i=i 
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set  {Paul,  Roger,  Sally}.  Since  Quinn  E  [Employee] | QUinn ,  the  bridge  burning  semantics 
excludes  Quinn  from  this  set.  This  makes  intuitive  sense  as  well — Quinn  should  not  be 
considered  her  own  colleague. 

A  relational  model  is  simply  a  collection  of  relational  dependencies.  Each  relational 
dependency  is  primarily  described  by  the  relational  path  of  the  parent  relational  variable 
(because,  for  canonically  specified  dependencies,  the  relational  path  of  the  child  consists  of 
a  single  item  class).  The  relational  path  specification  is  used  in  the  construction  of  ground 
graphs,  connecting  variable  instances  that  appear  in  the  terminal  sets  of  the  parent  and 
child  relational  variables. 

To  characterize  the  expressiveness  of  relational  models,  we  can  inspect  the  space  of  rep¬ 
resentable  ground  graphs  by  choosing  an  arbitrary  relational  skeleton  and  a  small  set  of 
relational  dependencies.  We  show  with  a  simple  example  that  the  bridge  burning  semantics 
for  a  model  over  a  two-entity,  bivariate  schema  yields  more  possible  ground  graphs  than 
without  bridge  burning.  (We  omit  the  relationship  class  for  simplicity.)  In  Figure  B.  HE 
we  present  such  a  model  with  two  possible  relational  dependencies  labeled  (1)  and  (2). 
Figure  B.l  }b)  provides  a  simple  relational  skeleton  involving  three  A  and  three  B  instances 
(relationship  instances  are  represented  as  dashed  lines  for  simplicity).  As  shown  in  Fig¬ 
ure  B.l|c)  the  bridge  burning  semantics  leads  to  three  possible  ground  graphs,  one  for  each 
combination  of  the  dependencies  (1),  (2),  and  both  (1)  and  (2)  together.  Without  bridge 
burning,  only  two  ground  graphs  are  possible  because  dependency  (2)  completely  subsumes 
dependency  (1)  with  those  semantics. 

This  example  generalizes  to  arbitrary  dependencies.  The  terminal  sets  of  relational  paths 
that  repeat  item  classes  subsume  subpaths  under  the  semantics  without  bridge  burning. 
This  leads  to  fewer  possible  relational  models,  which  justifies  our  choice  of  semantics  for 
terminal  sets  of  relational  paths. 


Appendix  C.  Soundness  and  Completeness  of  Relational  Paths 

In  this  appendix,  we  prove  that  the  definition  of  relational  paths  (repeated  below)  is  sound 
and  complete  with  respect  to  producing  non-empty  terminal  sets  for  at  least  one  relational 
skeleton. 


Definition  4.3  (Relational  path)  A  relational  path  [ Ij ,. . . ,  A]  for  relational  schema  S 
is  an  alternating  sequence  of  entity  and  relationship  classes  Ij, ...  ,1^  E  £  U  1Z  such  that: 

(1)  For  every  pair  of  consecutive  item  classes  [E,  R]  or  [ R ,  E\  in  the  path,  E  E  R. 

(2)  For  every  triple  of  consecutive  item  classes  [E,  R,  E'],  E  ^  E' . 

(3)  For  every  triple  of  consecutive  item  classes  [R,  E,  R'],  if  R  =  R' ,  then  card(R,E )  = 
MANY. 


Lemma  C.l  Let  S  be  a  relational  schema  and  [Ij, ... ,  Rf\  be  a  sequence  of  alternating 
entity  and  relationship  classes  of  S  that  satisfy  participation  constraints  ( condition  (1)  of 
Definition\4.S\).  The  relational  path  [Ij, . . . ,  //,,]  satisfies  conditions  (2)  and  (3)  of  Definition 
|^.3|  if  and  only  if  there  exists  a  relational  skeleton  a  E  £5  and  an  item  instance  ij  E  <j{Ij) 
such  that  [Ij, . . . ,  Jfc]|i  /  0.  More  formally, 
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3(7  €  3ij  ea(Ij)  ([ Ij , . . . ,  IfcJli.  +  0)  4^  {[ERE\  0  [Ij,  ...,  R]) 

A 

([iLER]  G  [Ij, ... ,  Ik]  — »  card(R,  E)  =  many) 


Proof.  Left-to-right  =>:  Assume  that  there  exists  a  skeleton  u  G  S5  and  item  instance 
ij  G  cr(Ij)  such  that  [Ij, . . . ,  R]\ij  7^  0.  We  must  show  that  [Ij, . . . ,  Ik]  obeys  conditions  (2) 
and  (3),  i.e. ,  [Ij, . . .  ,R]  does  not  contain  any  [ERE]  patterns,  and  if  it  contains  an  [RER] 
pattern,  then  card(R,  E)  =  MANY. 


Assume  for  contradiction  that  [Ij, ... ,  /*,]  contains  a  pattern  of  the  form  [ERE].  From 
Definition  4.4  for  terminal  sets,  it  follows  that  if  the  terminal  set  of  a  path  is  not  empty, 
then  the  terminal  set  of  every  prefix  of  that  path  is  not  empty: 


[Ij,...,  4]|i,  /  0  =>  [Ij,  •  •  • ,  ImWi,  +  0  for  all  [Ij, ... ,  Im]  <  [Ij, ...,  h] 

By  assumption,  [Ij, ... ,  R]\i]  7=  0;  therefore,  the  prefix  [Ij, . . . ,  Im\  that  ends  in  the 
ERE  pattern  also  has  a  non-empty  terminal  set: 


[Ij,...,Ik]\ij7I(b^[Ij,...,E,R,E]  |i.^0 
[Ij,...,  Ik\\ij^<H=>[Ij,...,E,R\\i.^1li 
[Ij, ...,  IkWij  7^  0  =>  [Ij,  ■  ■  ■ ,  E]\ ij  /  0 

Let  e  G  &(E)  be  an  entity  instance  in  the  terminal  set  [Ij, . . . ,  E] |j  ..  Since  the  terminal 
set  [Ij, . . .  ,E,R]\i-  is  not  empty,  it  follows  that  there  exists  a  relationship  instance 
r  =  (...,e, ...)  such  that  r  G  [Ij, . . .  ,E,R]\ij .  However,  [Ij, . . . ,  E,  R,  E]\ij  is  also 
not  empty;  thus,  there  exists  some  e!  G  a (E)  such  that  e'  G  [Ij, . . . ,  E,  R,  E] |j  .,  where 
e'  7^  e,  and  e!  G  r.  It  follows  that  both  e  and  e!  participate  in  the  relationship  instance 
r,  which  is  a  contradiction. 

•  Assume  for  contradiction  that  [Ij, . . . ,  R]  contains  a  pattern  of  the  form  [R,  E,  R]  and 
card{R,E)  =  ONE. 

[Ij,  ■  ■  ■ ,  4]  I  ij  7^  0  =  (e, . . .)  G  [Ij , . . . ,  R]  (1) 

[Ij,  ■  ■  ■ ,  R,  E]\ij  7^  0  3e  G  [Ij,...,  R,  E]  and  e  G  r 
[Ij,  ...,R,E,  R]  |j3.  /  0  =>•  3r'  =  (e, . . .)  such  that  r'  G  [Ij,  ...,R,E,  R]  (2) 

and  r'  7^  r  (bridge  burning  semantics) 


From  (1)  and  (2)  it  follows  that  e  participates  in  two  instances  of  R ;  therefore, 
card(R,  E)  must  be  many,  which  is  a  contradiction. 


Right-to-left  <^=:  Assume  that  [Ij, ... ,  R]  adheres  to  Definition |4. 3| for  relational  paths. 
We  must  show  that  3er  G  S5  3 ij  G  cr(Ij)  [[Ij, . . . ,  R\\ij  /  0).  We  can  construct  such  a 
skeleton  a  according  to  the  following  procedure:  For  each  entity  class  E  on  the  path,  add 
a  unique  entity  instance  e  to  a (E).  Then,  for  each  relationship  class  R  on  the  path,  add 
a  unique  relationship  instance  r  connecting  the  previously  created  unique  entity  instances 
that  participate  in  R,  and  add  unique  entity  instances  for  classes  E  G  R  not  appearing  on 
the  path.  This  process  constructs  an  admissible  skeleton — all  instances  are  unique  and  this 
process  assumes  no  cardinality  constraints  aside  from  those  required  by  Definition  4.3  By 
construction,  there  exists  an  item  instance  ij  G  cr(Ij)  such  that  [Ij, . . . ,  R]]^  /  0.  ■ 
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Appendix  D.  Background  on  Propositional  Data  and  Models 


In  this  appendix,  we  provide  a  brief  review  of  Bayesian  networks,  traditional  d-separation, 
and  their  connection  to  causality.  We  also  describe  why  the  class  of  Bayesian  networks  is  a 
special  case  of  relational  models.  Finally,  we  give  an  example  of  how  to  propositionalize  a 
data  set  drawn  from  a  relational  domain. 

A  common  assumption  in  classical  statistics,  machine  learning,  and  causal  discovery  is 
that  data  instances  are  independent  and  identically  distributed  (HD).  The  first  condition 
assumes  that  the  variables  on  any  given  data  instance  are  marginally  independent  of  the 
variables  of  any  other  data  instance.  The  second  condition  assumes  that  every  data  instance 
is  drawn  from  the  same  underlying  joint  probability  distribution.  IID  data  (also  referred  to 
as  propositional  dati-j^l  are  effectively  represented  as  a  single  table,  where  rows  correspond 
to  the  independent  instances  and  columns  are  attributes  of  those  instances. 

A  Bayesian  network  is  a  widely  used  probabilistic  graphical  model  of  propositional  data 
1988).  A  Bayesian  network  is  represented  as  a  directed  acyclic  graph  G  =  (V,  E), 


( Pearl 


where  V  is  a  set  of  vertices  corresponding  to  random  variables  in  the  data  and  E  C  V  x  V 
is  a  set  of  edges  encoding  the  probabilistic  dependencies  among  the  variables.  Each  random 
variable  k  6  Vis  associated  with  a  conditional  probability  distribution  P(V  \  parents(V)) , 
where  parents (V)  C  V  \  {V}  is  the  set  of  parent  variables  for  V. 

If  the  joint  probability  distribution  P(V)  satisfies  the  Markov  condition  for  G,  then 
P(V)  can  be  factored  as  Ili/ev  I  parents(V ))  using  the  conditional  distributions.  The 
Markov  condition  states  that  every  variable  V  6  V  is  conditionally  independent  of  its 
non-descendants  given  its  parents,  where  the  descendants  of  V  are  all  variables  reachable 
by  a  directed  path  from  V.  Deriving  the  set  of  conditional  independencies  from  G  based 
on  the  Markov  condition  is  cumbersome,  requiring  complex  combinations  of  probability 
axioms.  Fortunately,  d-separation,  a  set  of  graphical  criteria,  provides  the  foundation  for 
algorithmic  derivation  of  all  conditional  independencies  in  G  and  entails  the  exact  same  set 
of  conditional  independencies  as  the  Markov  condition  (|Verma  and  Pearl  1988;  Geiger  and 


Pearl,  1988  Neapolitan,  2004). 


In  the  following  definition,  a  path  is  a  sequence  of  vertices  following  edges  in  either 
direction.  We  say  that  a  variable  V  is  a  collider  on  a  path  p  if  the  two  arrowheads  point  at 
each  other  (collide)  at  V;  otherwise,  V  is  a  non-collider  on  p. 


Definition  D.l  (d-separation)  Let  X,  Y,  and  Z  be  disjoint  sets  of  variables  in  directed 
acyclic  graph  G.  A  path  from  some  X  E  X  to  some  Y  E  Y  is  d-connected  given  Z  if  and 
only  if  every  collider  W  on  the  path,  or  a  descendant  of  W,  is  a  member  of  Z,  and  there  are 
no  non-colliders  in  Z.  Then,  say  that  X  and  Y  are  d-separated  by  Z  if  and  only  if  there  are 
no  d-connecting  paths  between  X  and  Y  given  Z. 


Figure  D.l  a)  depicts  the  graphical  patterns  found  along  paths  that  lead  to  d-separation 


or  d-connection  based  on  Definition  D.l,  and  Figure  D.l[b)  provides  example  d-separated 
and  d-connected  paths.  At  first  glance,  identifying  conditional  independence  facts  using  the 
rules  of  d-separation  appears  computationally  intensive,  testing  a  potentially  exponential 


12.  IID  data  are  typically  referred  to  as  propositional  because  the  data  can  be  equivalently  expressed  under 
propositional  logic. 
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(/-separating  path  elements 
(exists  one  on  path) 


d-connecting  path  elements 
(exists  all  on  path) 
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(a)  Graphical  patterns  of  d-separating  and  d-connecting  path  elements  among  disjoint 
sets  of  variables  X  and  Y  given  Z.  Paths  for  which  there  exists  a  non-collider  in  Z  or 
a  collider  not  in  Z  are  d-separating.  Paths  for  which  all  non-colliders  are  not  in  Z  and 
all  colliders  (or  a  descendant  of  colliders)  are  in  Z  are  d-connecting. 


(/-separated  paths 


(/-connected  paths 


®-*0,-0-*0-*©  (©— O —O—O--® 


®— -©—©—©— -©  ®—0— O— ©*-© 

(b)  Several  example  d-separated  and  d-connected  paths  that  illustrate  the  composition 
of  path  elements. 

Figure  D.l:  Patterns  of  d-separating  and  d-connecting  path  elements  and  example  d- 
separating  and  d-connecting  paths. 


number  of  paths.  However,  Geiger  et  al.  (1990)  provide  a  linear-time  algorithm  based  on 
breadth- first  search  and  reachability  on  G. 

Under  a  few  assumptions,  Bayesian  networks  can  be  interpreted  causally,  with  edges 
corresponding  to  direct  causal  dependencies.  If  X  — >  Y  is  an  edge  in  the  causal  model  G , 
then  manipulating  or  changing  the  value  of  X  will  alter  the  conditional  distribution  of  Y — 
denoted  as  P(Y  \  do(X ))  using  Pearl’s  do-calculus  notation  for  interventions  (Pearl,  2000). 
The  causal  interpretation  of  G  assumes  the  causal  Markov  condition,  which  is  identical 
to  the  Markov  condition,  replacing  parents  with  direct  causes  and  non-descendants  with 
non-effects.  In  order  for  the  causal  Markov  condition  to  hold,  the  variables  V  must  also 
be  causally  sufficient:  There  are  no  latent  common  causes  for  any  pair  of  variables  in  V. 
The  causal  Markov  condition  is  also  equivalent  to  d-separation;  therefore,  both  provide  the 
connection  between  causal  structures  and  probability  distributions. 

The  conditional  independencies  entailed  by  both  the  causal  Markov  condition  and  d- 
separation  hold  in  all  distributions  that  G  represents.  A  distribution  P  is  faithful  to  G  if 
all  conditional  independencies  in  P  are  entailed  by  the  causal  Markov  condition  on  G.  If 
P  is  assumed  to  be  faithful  to  G,  then  there  are  algorithms  that  can  learn  the  Markov, 
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SELECT 

FROM 


WHERE 


ti.id,  ti. salary,  t2 . success,  t3 . revenue 


( 

SELECT 

E.id,  E. salary 

FROM 

Employee  E)  ti, 

( 

SELECT 

E.id,  P. success 

FROM 

Employee  E,  Develops  D, 

Product 

P 

WHERE 

E.id  =  D.e_id  AND  D.p_id 

=  P . id) 

t2 

( 

SELECT 

E.id,  B. revenue 

FROM 

Employee  E,  Develops  D, 

Product 

P 

Funds  F,  Business-Unit  B 

WHERE 

E.id  =  D.e_id  AND  D.p_id 

=  P.id 

AND 

P . id  =  F . p_id  AND  F.b_id 

=  B . id) 

P.3 

ti.id  =  t2 . id  AND  t2 . id  =  t3 . id 


Figure  D.2:  Sketch  of  a  relational  database  query  that  joins  the  instances  of  three  rela¬ 
tional  variables  having  the  common  perspective  Employee  used  to  produce  the  data  in¬ 
stances  shown  in  Table  D.l  The  three  relational  variables  are  (1)  [Employee]. Salary,  (2) 
[Employee,  Develops,  Product], Success,  and  (3)  [Employee,  Develops,  Product, 
Funds,  Business-Unit]. .Revenue. 


or  likelihood,  equivalent  set  of  causal  models.  These  algorithms  assume  causal  sufficiency, 
faithfulness,  and  model  acyclicity  to  identify  the  edges  in  G  that  are  consistent  with  observed 


conditional  independencies  and  to  determine  the  direction  of  causality  (Spirtes  et  ah,  2000). 


The  relational  representation  presented  in  Section  [4]  is  strictly  more  expressive  than 
the  propositional  representation  used  in  Bayesian  network  modeling.  Propositional  repre¬ 
sentations  describe  domains  with  a  single  entity  class;  thus,  they  produce  schemas  with 
|£|  =  1  (one  entity  class)  and  \R\  =  0  (no  relationship  classes).  For  the  organization  do¬ 
main  example,  consider  data  about  only  employees  (£  =  {Employee}).  Variables  would 
include  intrinsic  attributes,  such  as  salary,  but  could  also  include  variables  describing  other 
related  entities,  all  from  the  employee  perspective.  This  technique  of  translating  a  relational 
database  down  to  a  single,  propositional  representation  is  often  referred  to  as  proposition- 
alization  (Kramer  et  al.[  2001).  That  is,  we  could  construct  a  single  table  for  employees 
that  includes  columns  for  the  success  of  developed  products,  the  revenue  of  all  business 


units  they  work  under,  etc.  In  Figure  D.2  we  show  an  example  SQL-like  query  that  would 
produce  such  data,  and  the  resulting  data  set  applied  to  the  example  in  Figure  2.1  b)  is 
shown  in  Table  iD.ll1^ 


The  relational  skeleton  of  a  Bayesian  network  consists  of  a  set  of  disconnected  entity 
instances,  all  drawn  from  the  same  entity  class.  Consequently,  the  skeleton  has  a  simple 
one-to-one  mapping  with  the  representation  as  a  table:  Each  entity  instance  corresponds 


13.  Note  that  modeling  propositionalized  data  with  Bayesian  networks  still  requires  the  IID  assumption, 
which  is  often  violated  since  variables  of  one  instance  can  influence  variables  of  another.  For  example, 

the  competence  of  collaborating  employees  influences  the  success 


according  to  the  model  in  Figure  2. 


of  products,  which  affects  the  revenue  of  business  units,  which  affects  its  budget,  thereby  influencing  an 
employee’s  salary.  As  a  result,  modeling  relational  data  with  a  propositional  representation  may  unnec¬ 
essarily  lose  valuable  information,  especially  in  the  context  of  causal  reasoning  and  accurate  estimation 
of  causal  effects. 
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Employee 

[Employee]  .  Salary 

[Employee,  Develops, 
Product].  Success 

[Employee,  Develops, 
Product,  Funds, 
Business-  Unit]  .  Revenue 

Paul 

{Paul  .Salary} 

{Cas  e. Success} 

{Accessories.  Revenue} 

Quinn 

{  Quinn.  Salary} 

{Case.  Success, 

Adapter.  Success, 

Laptop.  Success} 

{  Accessories  .Revenue, 
Devices.  Revenue} 

Roger 

{Roger  .Salary} 

{Laptop.  Success} 

{Devices.  Revenue} 

Sally 

{Sally  .Salary} 

{  Laptop.  Success, 

Tablet.  Success} 

{Devices  .Revenue} 

Thomas 

{Thomas.  Salary} 

{  Tablet.  Success, 
Smartphone.  Success} 

{Devices.  Revenue} 

Table  D.l:  Propositional  table  consisting  of  employees,  their  salary,  the  success  of  products 
they  develop,  and  the  revenue  of  the  business  units  they  operate  under.  Producing  this  table 
requires  joining  the  instances  of  three  relational  variables,  all  from  a  common  perspective — 
Employee. 

to  a  single  row,  and  each  variable  is  a  column.  In  this  example,  each  employee  would  be  an 
entity  instance,  and  no  instances  of  other  entity  types  or  relationships  would  appear  in  the 
skeleton.  Because  all  variables  in  a  Bayesian  network  are  defined  for  a  single  entity  class  and 
no  relationships,  the  relational  path  specification  becomes  trivial  and,  hence,  implicit.  All 
relational  paths,  relational  variables,  and  relational  dependencies  are  defined  from  a  single 
perspective  with  singleton  paths  (e.g.,  [Employee]).  The  ground  graph  of  a  Bayesian 
network,  similar  to  the  skeleton,  has  a  very  regular  structure.  The  ground  graph  consists 
of  a  set  of  identical  copies  of  the  model  structure,  one  for  each  instance  in  the  skeleton.  For 
a  Bayesian  network,  d-separation  can  be  applied  directly  to  the  model  structure  because 
there  is  no  variability  in  its  ground  graphs. 

Appendix  E.  Hop  Thresholds 

For  practical  implementations,  the  size  of  the  abstract  ground  graphs  should  be  limited  by 
a  domain-specific  threshold.  In  this  work,  we  choose  to  apply  a  singular  hop  threshold  to 
the  relational  paths  that  are  represented  in  an  abstract  ground  graph.  In  this  appendix,  we 
examine  the  effect  of  choosing  a  particular  hop  threshold. 

First,  we  introduce  the  notion  of  ( B ,  ^-reachability,  which  describes  the  conditions 
under  which  an  edge  in  a  ground  graph  is  represented  in  an  abstract  ground  graph. 

Definition  E.l  (( B ,  h)-reachability)  Let  GGmo  be  the  ground  graph  for  some  relational 
model  structure  M.  and  skeleton  a  G  £5.  Then,  i^.X  — >  ij.Y  G  GGmo  is  (B ,  h) -reachable 
for  perspective  B  and  hop  threshold  h  if  there  exist  relational  variables  Pk-X  =  [ B , . . . ,  If.}  .X 
and  Pj.Y  =  [B, . . . ,  Ij].Y  such  that  length(Pk)  <  h  +  1,  length(Pj)  <  h+  1,  and  there  exists 
an  instance  b  G  u{B)  with  ik  G  Pk\b  and  ij  G  Pj\b- 

In  other  words,  the  edge  ik-X  — >  ij.Y  in  the  ground  graph  is  (B,h)- reachable  if  an 
instance  of  the  base  item  b  G  <j{B)  can  reach  ik  and  ij  in  at  most  h  hops. 
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Since  Definition  E.l  pertains  to  edges  reachable  via  a  particular  perspective  B  and  hop 
threshold  h,  it  relates  to  the  reachability  of  edges  in  abstract  ground  graphs.  We  denote 
abstract  ground  graphs  for  perspective  B,  limited  by  a  hop  threshold  h  as  AGGmbYi- 
Definition  E.l  implies  that  (1)  for  every  edge  in  ground  graph  GGmv,  we  can  derive  a  set 
of  abstract  ground  graphs  for  which  that  edge  is  ( B ,  L)-reachable,  and  (2)  for  every  abstract 
ground  graph  AGG_MBh,  we  can  derive  the  set  of  ( B ,  /i)-reachable  edges  for  a  given  ground 
graph.  Given  (P,  /^-reachability,  we  can  now  express  the  soundness  and  completeness  of 
abstract  ground  graphs. 


Theorem  E.l  For  every  acyclic  relational  model  structure  M,  perspective  B  E  £ UTZ,  and 
hop  threshold  ha  E  N°,  the  abstract  ground  graph  AGG MBha  is  sound  up  to  hop  threshold 
ha  for  all  ground  graphs  GGmct  with  skeleton  <r  G  E5. 


Proof.  Soundness  means  that  for  every  edge  [B, . . . , Ij\-X  — »  [B, . . .  ,/&]. Y  in  the  abstract 
ground  graph  AGG_MBha,  there  exists  a  skeleton  a  E  £$,  a  base  item  instance  b  E  cr(B), 


an  instance  ij  E  [B, . . . ,  Ij]\b,  and  an  instance  ik  E  [B, . . . ,  Ik\\b  such  that  ij.X  — >  ik-Y  is 


a  ( B ,  ha) -reachable  edge  in  GGmc 
Theorem  5.2  (see  Appendix  [A]).  ■ 


The  proof  is  identical  to  the  proof  of  soundness  for 


Theorem  E.2  For  every  acyclic  relational  model  structure  M,  perspective  B  E  £  UTZ,  and 
hop  threshold  hr  E  N°,  the  abstract  ground  graph  AGG MBha  complete  up  to  hop  threshold 
hr  for  all  ground  graphs  GGmv  with  skeleton  a  E  E5,  where  ha  =  max(/ir+/im,  hr+2hm—2) 
and  hm  is  the  maximum  number  of  hops  for  a  dependency  in  M. . 


Proof.  Let  M  =  (S,  T>)  be  an  arbitrary  acyclic  relational  model  structure,  let  B  E  £  U  1Z 
be  an  arbitrary  perspective,  and  let  hr  E  N°  be  an  arbitrary  hop  threshold. 

To  prove  that  the  abstract  ground  graph  AGGMBha  is  complete  up  to  hop  threshold  hr, 
we  show  that  for  every  (B,hr)- reachable  edge  ik-X  ->  ij.Y  in  every  ground  graph  GGmct 
with  a  E  Eg,  there  is  a  set  of  corresponding  edges  in  AGGMBha-  Specifically,  the  ( B ,  hr)~ 


reachable  edge  ik-X  — >  ij.Y  yields  two  sets  of  relational  variables  for  some  b  E  cx(B), 
namely  Pk-X  =  { Pk-X  \  ik  E  Pk\b  A  length(Pk)  <  hr  +  1}  and  Pj.Y  =  {Pj-Y  \  ij  E 
Pj\b  A  length(Pj)  <  hr  +  1}  by  Definition |E.l  Note  that  all  relational  variables  in  both 
Pk-X  and  Pj-Y  are  nodes  in  AGGj^Bha-  We  show  that  for  all  Pk-X  E  Pk-X  and  for  all 
Pj.Y 6 Pj.Y  either  (a)  Pk.X  ->  Pj.Y  E  AGGMBha,  (b)  Pk-XnP'k.X  Pj.Y  E  AGGMBha 


or  Pk-X  n  Pi. X  — >  P'-.Y  E  AGGMBha ,  where  ik  E  Pi\b  and  ij 


P3.Y  n  P'.Y  E  AGG mbk 


or  P'k.X 


€  i>, 


or 


Pk-X 


Pj.Y  n  Pj.Y  E  AGGMBha,  where  ik  E  P'k\b  and 


ij  G  P'j\b- 

Let  a  E  E5  be  an  arbitrary  skeleton,  let  ik-X  — y  ij.Y  E  GGmv  be  an  arbitrary  (B,  hr)- 
reachable  edge  drawn  from  [Ij, . . . ,  Ik]-X  — >  [Ij\-Y  E  V  where  length([Ij, . . . ,  If.])  <  hm  +  1, 
and  let  Pk-X  E  Pk-X,  Pj.Y  E  Pj-Y  be  an  arbitrary  pair  of  relational  variables.  There  are 
three  cases: 

(a)  Pk  E  extend(Pj,  [Ij, . . . ,  Ik})-  Then,  length(Pk)  <  (hr  + 1)  +  (hm  + 1)  —  1  =  hr  +  hm  + 
1  <  ha  +  1.  Therefore,  Pk-X  is  a  node  in  the  abstract  ground  graph,  and  Pk-X  — y  Pj-Y  E 
AGGMBha  by  Definition  |5.2| 

(b)  Pk  extend(Pj,  [Ij, . . .  ,Ik\),  but  3Pk  E  extend(Pj,  [Ij, . . .  ,Ik])  such  that  ik  E  P'k\b- 
Then,  length(Pk)  <  (hr+l)+(hm+l)  —  l  =  hr+hm+ 1  <  ha+ 1.  Therefore,  P'k  is  a  node  in  the 
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abstract  ground  graph,  P'k-X  — y  Pj.Y  G  AGGMBha,  and  Pk.XnPk.X  -»  Pj.Y  G  AGG_MBha 
by  Definition  |5.2[ 

(c)  For  all  Pk  G  extend  (Pj,  [Ij, /&]),  it  is  the  case  that  ik  ^  -Pfc.X|b.  Then  by 
Lemma  5.2  there  exists  a  Pj  such  that  ij  G  Pj|&  and  there  exists  a  Pk  G  extend(Pj,  [Ij, . . . ,  /&]). 
Given  the  way  Pj  is  constructed,  its  length  is  bounded  by: 

length(Pj)  <  length(Pj)  +  length([Ij, . . .  ,Ik])  —  3  <  (hr  +  1)  +  (hm  +  1)  —  3  =  hr  +  hm  —  1 

Pj.7  intersects  with  P^.  since  they  both  reach  ik,  and  the  length  of  Pj7  is  bounded  by: 

length(Pk)  <  length(Pj)  +  length([Ij, . . .  ,Ik])  —  l  <  (hr  +  hm  —  l)  +  (hm  +  l)  —  l  =  hr+2hm  —  1 

Also  by  Lemma  5.2,  we  know  that  Pj  and  Pj  intersect.  Since  length(Pf)  <  hr  +  2hm  —  1  < 

/ia  +  1,  Pk  is  a  node  in  the  abstract  ground  graph,  Pjf-X  — >  Pj.Y  G  AGGMBha  P'k-X 
P'j-Y  n  Pj.Y  G  AGG MBha,  and  Pk.X  n  Pf.X  — >  Pj.Y  G  AGGMBha  by  Definition 


5.2 


From  the  above  three  cases,  it  follows  that  to  guarantee  completeness  up  to  hr,  the 
abstract  ground  graph  must  contain  nodes  up  to  the  hop  threshold  ha  =  ma x(hr  +  hm,  hr  + 

2hm-2).  m 


Theorems  E.l  and  E.2  guarantee  that  if  an  abstract  ground  graph  is  constructed  with 
a  hop  threshold  of  ha  from  perspective  B,  it  captures  all  paths  of  dependence  in  all  ground 
graphs,  where  (1)  the  variables  along  those  paths  are  reachable  in  hr  hops  from  instances 
of  B  and  (2)  the  underlying  dependencies  are  bounded  by  a  threshold  of  hm. 

In  the  following,  we  say  that  d-separation  holds  up  to  a  specified  hop  threshold  h  if 
there  are  no  d-connecting  paths  involving  relational  variables  of  length  greater  than  h  +  1. 


Theorem  E.3  Relational  d-separation  is  sound  and  complete  for  abstract  ground  graphs 
up  to  a  specified  hop  threshold.  Let  M  be  an  acyclic  relational  model  structure,  and  let 
hm  be  the  maximum  number  of  hops  for  a  dependency  in  Xi.  Let  X,  Y,  and  Z  be  three 
distinct  sets  of  relational  variables  for  perspective  B  G  £  U  TZ  defined  over  relational  schema 
S,  and  let  hr  be  the  maximum  number  of  hops  of  relational  variables  in  X,  Y,  and  Z.  Then, 
X  and  Y  are  d-separated  by  Z  on  the  abstract  ground  graph  AGG MBha  If  and  only  if  for 
all  skeletons  a  G  and  for  all  b  G  cr(B),  X|&  and  Y|&  are  d-separated  by  Z|&  up  to  hop 
threshold  hr  zn  ground  graph  G G ;  where  h a  =  max^/i^.  T  /im,  hr  T  2 h  m  2) . 


Proof.  We  must  show  that  d-separation  on  an  abstract  ground  graph  implies  d-separation 
on  all  ground  graphs  it  represents  (soundness)  and  that  d-separation  facts  that  hold  across 
all  ground  graphs  are  also  entailed  by  d-separation  on  the  abstract  ground  graph  (com¬ 
pleteness). 

Soundness:  Assume  that  X  and  Y  are  d-separated  by  Z  on  AGG MBha-  Assume  for 


contradiction  that  there  exists  a  skeleton  a  G  S5  and  an  item  instance  b  G  cr(B)  such  that 
X|t  and  Y|&  are  not  d-separated  by  Z|&  in  the  ground  graph  GGmv-  Then,  there  must 
exist  a  d-connecting  path  p  from  some  x  G  X|j,  to  some  y  G  Y|^  given  all  2  G  Z|f,  such 
that  every  edge  of  p  is  ( B ,  /ir)-reachable.  By  Theorem  E.2,  AGG_MBha  is  (B,  /ir)-reachably 
complete,  so  all  (B,  /ir) -reachable  edges  in  GGjvia  are  captured  by  edges  in  AGG MBha- 
Thus,  path  p  must  be  represented  from  some  node  in  {Nx  \  x  G  Nx )&}  to  some  node  in 
{Ny  |  y  G  Ny\b},  where  Nx,Ny  are  nodes  in  AGG_MBha-  If  P  is  d-connecting  in  GGmct, 


52 


Independence  in  Models  of  Relational  Data 


Predictor 

Coefficient 

Partial 

Semipartial 

log(#  dependencies)  x  #  entities 

1.38 

0.232 

0.085 

log(#  dependencies) 

1.14 

0.135 

0.044 

log(#  dependencies)  x  #  MANY  cardinalities 

-0.71 

0.092 

0.028 

#  entities  x  #  relational  variables 

-0.32 

0.044 

0.013 

Table  F.l:  Number  of  equivalent  conditional  independence  judgments:  estimated  standard¬ 
ized  coefficient,  squared  partial  correlation  coefficient,  and  squared  semipartial  correlation 
coefficient  for  each  predictor. 


then  it  is  d-connecting  in  AGG_MBha,  which  implies  that  X  and  Y  are  not  d-separated  by 
Z.  Therefore,  X|&  and  Y|&  must  be  d-separated  by  Z|&. 

Completeness:  Assume  that  X|&  and  Y|j,  are  d-separated  by  Z| in  the  ground  graph 
GGmct  f°r  skeletons  a  G  Y,$  and  for  all  b  G  &{B).  Assume  for  contradiction  that  X  and 
Y  are  not  d-separated  by  Z  on  AGG MBha-  Then,  there  must  exist  a  d-connecting  path 
p  for  some  relational  variable  X  G  X  to  some  Y  E  Y  given  all  Z  G  Z.  By  Theorem 
AGGMBha  is  (B,  /ta)-reachably  sound,  so  every  edge  in  AGG^Bh  must  correspond  to  some 
pair  of  variables  in  some  ground  graph.  Thus,  if  p  is  d-connecting  in  AGGMBha,  then  there 
must  exist  some  skeleton  a  such  that  p  is  d-connecting  in  GGmcj  f°r  some  b  G  cr(-B),  which 
implies  that  d-separation  does  not  hold  for  that  ground  graph.  Therefore,  X  and  Y  must 
be  d-separated  by  Z  on  AGG MBha-  ® 


E.l 


Appendix  F.  Experimental  Details — Equivalence  of  a  Na'ive  Approach 


In  this  appendix,  we  provide  additional  details  for  the  experiment  in  Section  [6j  The  main 
goal  of  this  experiment  is  to  quantify  how  often  traditional  d-separation  applied  directly  to 
relational  model  structures  produces  incorrect  conditional  independence  facts.  This  provides 
a  rough  measurement  for  the  additional  representational  power  of  relational  d-separation 
on  abstract  ground  graphs.  Here,  we  present  an  analysis  of  which  factors  influence  the 
number  of  equivalent  and  non-equivalent  conditional  independence  judgments  between  both 
approaches  (naively  applying  traditional  d-separation  versus  relational  d-separation). 

Specifically,  we  show  here  the  results  of  running  log-linear  regression  to  predict  the  num¬ 
ber  of  equivalent  and  non-equivalent  judgments  for  varying  schemas  and  models.  We  first 
applied  lasso  for  feature  selection  (Tibshirani,  1996)  to  minimize  the  number  of  predictors 


while  maximizing  model  fit.  We  also  standardized  the  input  variables  by  dividing  by  two 


standard  deviations,  as  recommended  by  Gelman  (2008).  Since  the  predictor  for  the  num¬ 


ber  of  dependencies  is  log-transformed,  the  standardization  for  that  variable  occurs  after 
taking  the  logarithm. 

In  predicting  the  (log  of  the)  number  of  equivalent  conditional  independencies,  the 
following  variables  were  significantly  and  substantively  predictive  (in  order  of  decreasing 
predictive  power): 


•  Interaction  between  the  log  of  the  number  of  dependencies  and  the  number  of  entities 
(positive) 

•  Log  of  the  number  of  dependencies  (positive) 
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Predictor 

Coefficient 

Partial 

Semipartial 

#  many  cardinalities  x  #  entities 

-2.22 

0.207 

0.064 

l°g(#  dependencies)  x  #  entities 

0.90 

0.165 

0.048 

jfc  MANY  cardinalities 

3.24 

0.128 

0.036 

log(#  dependencies)  x  #  MANY  cardinalities 

1.47 

0.127 

0.036 

Table  F.2:  Number  of  non-equivalent  conditional  independence  judgments:  estimated  stan¬ 
dardized  coefficient,  squared  partial  correlation  coefficient,  and  squared  semipartial  corre¬ 
lation  coefficient  for  each  predictor. 


•  Interaction  between  the  log  of  the  number  of  dependencies  and  the  number  of  many 
cardinalities  (negative) 

•  Number  of  entities  (negative) 

•  Interaction  between  the  number  of  entities  and  the  number  of  relational  variables  in 
the  AGG  (negative) 


The  fit  for  the  equivalent  model  has  an  R 2  =  0.721  for  n  =  4, 000,  and  Table 


F.l 


con¬ 
tains  the  standardized  coefficients  as  well  as  the  squared  partial  and  semipartial  correlation 
coefficients  for  each  predictor.  For  lasso,  A  =  0.0076  offered  the  fewest  predictors  while 
increasing  the  model  fit  by  at  least  0.01. 

In  predicting  the  (log  of  the)  number  of  non-equivalent  conditional  independencies,  the 
following  variables  were  significantly  and  substantively  predictive  (in  order  of  decreasing 
predictive  power): 


•  Interaction  between  the  number  of  MANY  cardinalities  and  the  number  of  entities 
(negative) 

•  Interaction  between  the  log  of  the  number  of  dependencies  and  the  number  of  entities 
(positive) 

•  Number  of  MANY  cardinalities  (positive) 

•  Interaction  between  the  log  of  the  number  of  dependencies  and  the  number  of  many 
cardinalities  (positive) 


contains  the  standardized  coefficients  and  the  squared  partial  and  semipartial  correlation 
coefficients  for  each  predictor.  For  lasso,  A  =  0.0155  offered  the  fewest  predictors  while 
increasing  the  model  fit  by  at  least  0.01. 


The  fit  for  the  non-equivalent  model  has  an  R.  =  0.755  for  n  =  4,  000,  and  Table  F.2 


Appendix  G.  Experimental  Details — Abstract  Ground  Graph  Size 


In  this  appendix,  we  provide  additional  details  for  the  experiment  in  Section  |7.1[  The 
goal  of  this  experiment  is  to  determine  which  factors  influence  the  size  of  abstract  ground 
graphs  because  the  computational  complexity  of  relational  d-separation  depends  on  their 
size.  Specifically,  we  show  here  the  results  of  running  log-linear  regression  to  predict  the 
size  of  abstract  ground  graphs  for  varying  schemas  and  models.  We  first  applied  lasso  for 
feature  selection  (Tibshirani,  1996)  to  minimize  the  number  of  predictors  while  maximizing 
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Predictor 

Coefficient 

Partial 

Semipartial 

#  relationships 

3.24 

0.452 

0.150 

#  many  cardinalities  x  isEntity=F 

3.09 

0.349 

0.109 

#  entities 

-2.11 

0.359 

0.102 

#  MANY  cardinalities  x  isEntity=T 

2.51 

0.216 

0.053 

#  many  cardinalities  x  relationships 

-0.88 

0.100 

0.020 

#  attributes 

0.23 

0.024 

0.004 

Table  G.l:  Number  of  nodes  in  an  abstract  ground  graph:  estimated  standardized  coeffi¬ 
cient,  squared  partial  correlation  coefficient,  and  squared  semipartial  correlation  coefficient 
for  each  predictor. 


Predictor 

Coefficient 

Partial 

Semipartial 

l°g(#  dependencies) 

1.44 

0.440 

0.165 

#  relationships 

3.86 

0.395 

0.138 

#  many  cardinalities  X  isEntity=F 

4.27 

0.356 

0.123 

#  entities 

-2.78 

0.353 

0.115 

#  MANY  cardinalities  x  isEntity=T 

3.52 

0.231 

0.067 

fj^  MANY  cardinalities  x  #  relationships 

-1.35 

0.127 

0.031 

Table  G.2:  Number  of  edges  in  an  abstract  ground  graph:  estimated  standardized  coeffi¬ 
cient,  squared  partial  correlation  coefficient,  and  squared  semipartial  correlation  coefficient 
for  each  predictor. 


model  fit.  We  also  standardized  the  input  variables  by  dividing  by  two  standard  deviations, 
as  recommended  by  Gelman  (2008).  Since  the  predictor  for  the  number  of  dependencies  is 
log-transformed,  the  standardization  for  that  variable  occurs  after  taking  the  logarithm. 

In  predicting  the  (log  of  the)  number  of  nodes,  the  following  variables  were  significantly 
and  substantively  predictive  (in  order  of  decreasing  predictive  power): 


•  Number  of  relationships  (positive) 

•  Interaction  between  many  cardinalities  and  an  indicator  variable  for  whether  the 
abstract  ground  graph  is  from  an  entity  or  relationship  perspective  (positive) 

•  Number  of  entities  (negative) 

•  Interaction  between  the  number  of  MANY  cardinalities  and  relationships  (negative) 

•  Total  number  of  attributes  (positive) 


The  fit  for  the  nodes  model  has  an  R 2  =  0.818  for  n  =  450,000,  and  Table  G.l 


con¬ 
tains  the  standardized  coefficients  as  well  as  the  squared  partial  and  semipartial  correlation 
coefficients  for  each  predictor.  For  lasso,  A  =  0.0095  offered  the  fewest  predictors  while 
increasing  the  model  fit  by  at  least  0.01. 

In  predicting  the  (log  of  the)  number  of  edges,  the  following  variables  were  significantly 
and  substantively  predictive  (in  order  of  decreasing  predictive  power): 


•  Log  of  the  number  of  dependencies  (positive) 
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•  Number  of  relationships  (positive) 

•  Interaction  between  many  cardinalities  and  an  indicator  variable  for  whether  the 
abstract  ground  graph  is  from  an  entity  or  relationship  perspective  (positive) 

•  Number  of  entities  (negative) 

•  Interaction  between  the  number  of  MANY  cardinalities  and  relationships  (negative) 


The  fit  for  the  edges  model  has  an  R2  =  0.789  for  n  =  450, 000,  and  Table  G.2|  contains 
the  standardized  coefficients  and  the  squared  partial  and  semipartial  correlation  coefficients 
for  each  predictor.  For  lasso,  A  =  0.0164  offered  the  fewest  predictors  while  increasing  the 
model  fit  by  at  least  0.01. 
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